Comparative analysis of multiple target tracking methods

In the field of computer vision, the multi-target tracking plays a vital task in detecting and tracking of targets and at the same time their identities are preserved. Accurate tracking of the targets is the key for several applications such as video surveillance, motion and pattern analysis, pedestrian tracking, etc. The main focus of this work has concentrated on the tracking of the human in the video sequences. In the recent years, there are many developments in the tracking of multitarget, however, in real-time situations the accuracy and precision are still a challenging task for the state-of-art algorithms. In MTT (Multiple Target Tracking), the first step is the target detection process which comprises of the segmentation, foreground and background extraction. The estimation of the trajectories can be performed on the later stage. In the single target tracking, the target is tracked within the specified area and possible trajectory can be obtained by joining the locations in which target has been moved from time to time. Similarly, in the multiple target tracking, more number of targets is observed simultaneously. The difficult process of correctly matching the identity of the target for corresponding detection is known as data association. The multi-target tracking also faces many challenges such as change in illumination, scale variations, out-of-plane rotation, severe occlusions, similar appearance of targets and multiple target interaction. In this work, various methods have been proposed to solve these issues. The solutions derived from the system are discussed on different aspect of the MTT system. The comparative studies of the various datasets are applied on the different methods are studied. The performances of the methods are evaluated quantitatively using the MTT metrics, and experimental comparisons among the state-of-the-art methods are proposed.


I. Introduction
In the field of computer vision, the multi-target tracking plays a vital task in detecting and tracking of targets and at the same time their identities are preserved.Accurate tracking of the targets is the key for several applications such as video surveillance, motion and pattern analysis, pedestrian tracking, etc.The main focus of this work has concentrated on the tracking of the human in the video sequences.In the recent years, there are many developments in the tracking of multitarget, however, in real-time situations the accuracy and precision are still a challenging task for the state-of-art algorithms.In MTT (Multiple Target Tracking), the first step is the target detection process which comprises of the segmentation, foreground and background extraction.The estimation of the trajectories can be performed on the later stage.In the single target tracking, the target is tracked within the specified area and possible trajectory can be obtained by joining the locations in which target has been moved from time to time.Similarly, in the multiple target tracking, more number of targets is observed simultaneously.The difficult process of correctly matching the identity of the target for corresponding detection is known as data association.The multi-target tracking also faces many challenges such as change in illumination, scale variations, out-of-plane rotation, severe occlusions, similar appearance of targets and multiple target interaction.In this work, various methods have been proposed to solve these issues.The solutions derived from the system are discussed on different aspect of the MTT system.The comparative studies of the various datasets are applied on the different methods are studied.The performances of the methods are evaluated quantitatively using the MTT metrics, and experimental comparisons among the state-of-the-art methods are proposed.

II. Related Work
In the literature survey, it is understood that for more than a decade the multiple target tracking is the active area of research in computer vision.Now it is time to elaborately discuss the existing works based on multiple target tracking.Initially, several algorithms based on recursive methods [1] [2] were using recursive approach for tracking of the targets.In this method, Kalman filter approach has been used, in which the present state is updated based on the previous frame information.In sequential Monte Carlo sampling method, the distribution consists of weighted particles which are used to specify the current and hidden state information [3][4].This can handle the non linear and different mode of occurrence.This process will work well for less number of targets with small sample size.Practically, when the number of target increases, the reliable representation of target is difficult because it requires a large number of samples, to handle the data association.However, this can be done with the help of Markov chain Monte Carlo method [5] or probabilistic filtering.During the last few years, non-recursive methods are used progressively due to its popularity.These approaches are used to estimate the trajectory within a time window.The computation of the trajectory can be controlled by allowing the steps to pass through any one of the following, one is the locations on a regular discrete grid and the other is non-maxima suppression for target detection.Thereby the solution space is limited to a finite state.
Michael et al. [6] has designed an algorithm to detect the number of active blob in the video sequence, and also it discusses the speed efficiency of the application.However, it does not insist on the tracking accuracy and precisions.Leibe et al. [7] introduces a heuristic approach to solve the local optimality of the target, the prior task of target detection and trajectory estimation are carried out by quadratic binary technique.Michael et al. [8,9] introduces a Gaussian mixture model which is based on likelihood matching methods in order to track the multiple targets in the video sequences.This includes the active background extraction, segmentation process and likelihood matching to distinguish the affinity between the targets.Furthermore, the cost optimization is performed by the assignment problem.In addition, the Kalman filter is applied to precisely track the target.Jiang et al. [10] designed a framework which integrates the process of tracking multi-targets as an integer linear program with few constraints imposed on layouts.Furthermore, the LP relaxation is applied to attain the global optimal solutions, but it is unsuccessful many times.Whenever a target is passed through the occlusion, a special node is created to handle the problem of occlusions.There are some serious limitations in this approach, such as the number of target has to be specified in advance and the number of occluded targets has to be defined to avoid collision.The exact localization of occluded target is important to achieve the high accuracy of target tracking.Michael et al. [11,12] has proposed an optimization of the multi-target tracking and occlusion handling technique using mean shift method.This focuses on the cost minimization of the target localization, the similarity matching between the candidate target and the actual targets by the colour feature.Furthermore, it discusses the optimization strategy and the energy minimization procedures.
According to Berclaz et al. [13] the tracking region is part into fragments of disjoint cells and created a concept of virtual location, which can generate fresh trajectories and take up the existing trajectory at some point of locations.The resulting solution of integer linear program is again being fed into the K-shortest path or LP-relaxation algorithm to speed the computation.The framework also includes the concept of similarity matching, thereby lowering the count of identity switches between the targets.Rodriguez et al. [14] employed a head-based tracking in a highly crowded region.The binary energy minimization function is used to point out the exact count of the targets with a certain constraint terms.Here the camera is set up at a high viewpoint which can be used in surveillance however this is not viable for other applications such as intelligent transportations, entertainment, etc. Xing et al. [15] designed a framework in which the gap occurs in the trajectory due to occlusion.The long trajectory can be built by connecting the tracklets along with the short tracklet generated without occlusions.In the crowded environment, many serious problems may occur as there will be a large variation of dynamic targets and repeated occlusions.This type of target tracking is rarely processed due to its implicit difficulties.However, Kratz and Nishino [16] worked spatio-temporal method to study the motion patterns in the crowded environment.The target likelihood is calculated by converting the distance of colour histogram into probability using the exponential function.Choi and Savarese [17] employed a mean shift tracker which utilizes colour histogram to detect the target sequentially.
According to Qin and Shelton [18] the appearance model is initialized as colour histograms, then the mean weight of all the detection responses are developed into trajectories.Bhattacharyya coefficient is used to calculate the likelihood of the two targets as colour histograms.The similarity of appearance is measured as Bhattacharyya distance between the tracklets and the mean HSV colour histogram.Zhang et al. [19] studied the appearance model uses RGB colour histogram to calculate the link affinity between detection responses.The similarity between the same target and different targets are obtained by the Gaussian distributions.However, the colour histogram cannot represent spatial information.Henriques et al. [20] employed a covariance matrix descriptor in order to represent target appearance model.The likelihood matching is performed by linking Gaussian distributions of detection response.Breitenstein et al. [21] represented the 2D image speed and the positions using motion model with constant velocity.Here the previous states of object are considered to reduce the noise obtained as the mean and variance from the Gaussian.Andriyenko and Schindler [22] and Milan et al. [23] also utilized a constant velocity model in which the energy terms such as trajectory persistence, mutual exclusion, fragmentation, and handling of occlusion are sum to obtain a cost minimization of multiple object tracking.According to Bae et al. [24] the online multi-object tracking problems are solved by associating the tracklet confidence and the fragmentation of tracklets are linked up without iterations.However, the issue of the association of tracklet in complex scenario remains unsolved.Wu et al. [25] employed a four body part detector for tracking human in the inter-object and scene occlusions.de Villiers et al. [27] designed a mean shift tracker for handling of objects in occlusions, but still there exist a few issues which remains unsolved.

III. Multiple Tracking Methods
MTT play an important role in computing the similarity between the appearances of the target.It is important to note that single object tracking is mainly focus on the discrimination of the target from the active backgrounds.In the real time, it is not easy to discriminate the multiple targets, hence the MTT need some additional information of appearance to the multiple targets from the background.

A. Gaussian Mixture Model Based Beta-likelihood Matching and Kalman Filter (GMM-BLM-KF)
The GMM-BLM-KF [8] is a parametric probability density model consists of Gaussian components is developed with the basic concept of the adaptive background model [26].The multimodal density of targets is obtained by combining these component functions.The colour feature is very commonly used features in tracking.These colour components can be utilized in the real time applications in tracking the colour based MTT and segmentation.In order to develop a sturdy model, a component mixture model related to the colour of background with respect to the foreground model is generated.The pixel classification is accomplished by Bayes theorem.Gaussian mixture model is the adaptable and effective for the online mode of target tracking and also it is suitable for slow illumination changing conditions.
A Gaussian mixture model is defined as the weighted sum of M Gaussian components which is given as in (1).
Here x represents D-dimensional data, i  is the mixture weights varies from 1 to M. And i  represents the mean vector and i  represents the covariance matrix.
is the constraint of the Gaussian mixture components.
The parameter of the Gaussian mixture components are fulfilled by mixture weight, mean vectors and covariance matrix of the component densities.This is represented as The covariance matrix i  can be represented as diagonal.The covariance of matrix, number of components, and the type of parameter are decided by the quantity of the data available for calculating the GMMs parameter.
On the whole, the feature density is formed by the Gaussian components action, and the correlation between feature elements of vector can be obtained by combining the diagonal covariance matrices.

1) Background Extraction
The Gaussians of the mixture model of background and foreground extraction is determined based on the changes in the pixel distributions.The Gaussian distribution includes the least variance and most supporting evidence.This can be dealt as follows, firstly, when the target is visible and persist till the end, the accumulation of supportive evidence and the variance are low for the background distributions.Secondly, when a new target occludes the background target, a new distribution is created or the variance of the existing distribution will increase.Thirdly, a moving target will have higher variance in distribution.The fig. 1 shows the tracking results of the method.
Here, a traditional technique of Gaussian is required to model the best background.Firstly, the Gaussians are ordered according to the values of   .Secondly, re-estimate the parameters and sort the distribution, so that most possible distributions remain on the top.Next, the B distribution is chosen accordingly, T represents the measure of the minimum portion of the background data.For single modal distribution, the T is assumed to be small.The T value will be higher for multi-modal distribution due to the frequent changes in the background.

2) Data Association and Likelihood Matching
In multi-target tracking, data association is a very important and the fundamental task.It is the process of associating uncertain measurements to known tracks.The probability distribution or the multiple targets of the state vector are generated.A weight is associated with each state.The target area of interest can be obtained as the weighted sum of distributions.Furthermore, likelihood of a target is measured using a Gaussian density function which is given as 3) Kalman Filtering Kalman Filtering is applied to the Gaussian distributed output to estimate the state of linear system.The efficiency of the computation is improved, and also the optimal solution is obtained.The target tracked from one frame to another frame in the video sequence, allows predicting the next instant of the target based upon the previous trajectories.Kalman filter for a Gaussian system are modelled to handle the target changes in the consecutive frames as X represents the state vector [x,y,u,v,ω,Δt] T , the state transition 1 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 1 0 0 0 0 0 1

  
St denotes random vector modelling the uncertainty of the model.

B. Global Energy Minimization and Optimization Technique(GEM-OT)
This recent work [11] discusses the global energy minimization and optimization techniques exclusively for efficient tracking where the tracking of multiple targets still remains a challenging task in the field of computer vision.This task cannot be fulfilled unless the target is tracked accurately in many applications such as video surveillance, pattern matching, intelligent system, and robots etc.The energy terms is determined by the motion of targets in each frame, locations, missing evidence of image, and limitations such as target motion smoothness and mutual target exclusions.The energy minimisation terms are formulated to develop an efficient and global optimal solution of multi-target tracking system.

1) Energy Function
The energy minimization technique is one of the most important tasks for tracking of multiple targets.The general objective of the methods is to provide a possible solution with a low cost.In order to accurately express the multi-target tracking, the energy terms are developed in a closed form to obtain an efficient gradient optimization solution.Each energy term linearly combined to form an energy function Here d represents the search space, depending on the length and the target count, its values varies accordingly.X is the world coordinates of all targets in all frames.

2) Target Tracking
A popular method of tracking by detection is applied to track the pedestrian in video sequences.The SVM detector based on sliding window is used to locate the pedestrian.The histogram of gradient is included in the detector to extract the feature of the pedestrian.The Non-maxima suppression detects the peaks and transform into image evidence, which is considered as a global coordinate system for tracking.The trajectories of the target which has been kept closer to the observations are the main objective of the data term.(9) The energy increases proportionately with difference of the calculated target location   is the weight for each detection.

3) Operational representation
The relative difference in movement of the target and the slow frame rate can be handled by introducing a constant velocity model which will reduce the gap between consecutive velocity vectors.
This model helps in reducing the switching of identities and supports the straight path.This model also smoothen the most of the misaligned detections.The smooth target trajectory produced is known as the intelligent smoothing.

4) Collision Handling
Another important issue while tracking the multiple targets is the collision.This model can handle the collision by applying a penalty when one or more target comes close to one another.
This model can handle difficult problems in the collision avoidance of the targets.The continuous terms of operational model and collision avoidance model will be helpful to achieve the data association indirectly.This will also provide an interpretation of data, in addition to the plausible trajectory and pleasing visualization.

5) Trajectory Processing
Fragmentation and sudden termination of tracking will occur when the target evidence is missed within tracking area.Hence, it is advisable to start and end the trajectories, once the target reaches the frame border.The tracking which do not abide the rule are penalized.The sigmoid is utilized in the middle of the border region.

 
W is the entry edge and it is set to w=1/r.S denotes the total number of frames.The starting and ending frame of the trajectory is denoted as u and v respectively.

6) Standardization
Lastly, to fit the data accurately, standardization is introduced to stop the arbitrary growing of the number of targets.It is handled by penalizing the number of previous targets.This model reduces the un-liked short tracks from the scene by comprising the trajectory length and the standardization term, thereby attaining a better performance.

C. Mean Shift Target Tracking
The Mean shift tracking [12] plays an important role in tracking of targets in an occlusion situation.This can handle different types of occlusions, scale variations and complex backgrounds.A similarity matching can be represented by obtaining the local minima between the actual target and reference target.It also uses various features in order to determine the target scale.

1) Similarity Model
The similarity measure between the target and the reference target is determined as a function.The RGB colour features with 16 bins per channel are used, in which large area is divided into number of sub-areas, each area has its own histogram.The similarity measure is achieved by applying the Bhattacharyya distance between sub-areas in order to obtain the relative closeness of the targets.
if the pixel at locationQ belongs to region R Q otherwise The binning function is represented as H. h is the histogram associated with the pixel location. is the Kronecker delta function.w is the probability of the feature in the target given w=1....m.K is the kernel and weights i  are given by S represents the size of sub region R in pixels.s C is the normalization constant and s is the kernel bandwidth, n is the normalized pixel location in the candidate region.The reference target and the candidate target are represented as q and p respectively.

2) Color features
The mean shift algorithm applies the colour histogram feature to obtain solution for the occlusions, scale variation, etc.There are few difficulties which occur when the target and the background colours are similar.According to this method, a three dimensional colour histogram will distinguish the target affinity.

3) Energy Reduction
The gradient based optimization techniques is chosen to minimize the differentiable energy components in a closed form.A six set of jump moves such as growing, shrinking, adding, removing, merging, splitting are iteratively processed for each greedy parameter selection and the conjugate gradient descent are performed finally to attain the independent optimization of individual targets upon convergence.

IV. Experimental Studies
The performances of the MTT algorithms are measured using the common metrics and the datasets.It is also compared with the other methods in order to validate the proposed algorithms.

A. PETS2009
The state-of-the-art multiple target tracking methods are evaluating using the publicly available datasets.Firstly, PETS 2009 consists of more video sequences with multiple views and its length varies from 90 to 800 frames.The videos are captured using a high calibrated camera with the pixel resolution of 768 X 576 with 7 fps.In each frame the number of target individual varies from 7 to 42.In this work, seven video sequences are chosen for testing.The first view of six video sequence (S1l1-1, s1l2-1, s2l2, s2l3 ) are captured in a crowded environment which is specifically used for the event recognition or density estimation and the remaining two video sequences (s2l1 and s3Mf1) are captured in a medium crowd.In addition to the above sequence TUD-Stadtmitte is also employed.This video is captured at a low view point in a very busy pedestrian street with 179 frames.This difficult video is applied to estimate the precision of the tracker.

B. CAVIAR
The video sequences are classified into two set, firstly, the scenario is captured using a single camera at the lobby and second at the corridor of a shopping mall.The groundtruth is obtained as a bounding box on all observed person.It also includes some complex occlusion and bad contrast background image.The fig. 2 is the tracking results of CAVIAR dataset.The resolution of the captured video is 385 X 288 pixels and 25 frames per second.

VI. Evaluation
The performance evaluation and comparative study of multi-target tracking is important and a challenging task.The different video sequences of the datasets are tested using the different methods in order to enable the work of the fair comparison.The video sequence from the CAVIAR, PETS2009 and TUD-Stadtmitte datasets are applied to influence training.The task of the computation and the measurement are performed using the CLEAR MOT metrics.In the proposed works of GMM-beta likelihood and Global optimization techniques depends on the parameters.The values of the parameters are set according to the requirements of the video sequences used, in order to achieve a maximum MTTA and MTTP scores.
Table 1 The results obtained by applying the S2L1, S2L2 and the St. George street crossing video sequences on the proposed GMM methods.Apart from the pedestrian, St. George street crossing video comprises of a varying backgrounds, swaying of the leaves, illumination changes and vehicle crossing.When the number of events is very large, the Gaussian distribution is applied to describe the physical events.The mean accuracy of the target tracked is above 90% and the precision is 70.2% which is also better.Initially, it takes little iteration of the distributions to distinguish the active Gaussian background and foregrounds.Once, the reference background is obtained, the target has been detected and tracked accurately.This type of the method can be used for the long run of the applications.The cost optimization is achieved by implementing the Munkres algorithm.This algorithm models an assignment problem as an N X M cost matrix, where each element represents the cost of assigning i-th frame to the j-th process and it figures out the least cost solution, choosing a single item from each row and column in the matrix, such that no row and no column are used more than once.It runs in O (n 3 ) times rather than O (n!).It is also used to maximize the likelihood matching of the targets.Table 2 gives the results obtained from global energy minimization and optimization method.The metrics is applied on all video sequences individually.The performance of the tracker is varied upon the number of the targets encountered in the frame.The video sequences such as PETS-S2L1, TUD-Stadtmitte comprises of less than 10 targets in a frame.In these scenario, the MTTA are over 90% and 70% respectively, shows a better performance, because all the target pedestrian are visible all the time, and contains less occlusions.However, MTTA is reduced to 58%, because PETS-S2L2, S2L3 are the challenging datasets consisting of more than 40 targets which appears in the same frame with severe occlusions.The performance are measured based on the number of target, in this case, six dataset is divided into two groups, the group one consist of the video sequence with targets less than 10 pedestrians with average accuracy is 81.95% and the second group is above 40 targets with the mean accuracy of 48.52%.The accuracy of the tracking result increases by 8% in the case of most difficult scenes (PETS-S2L2), and the number of target mostly tracked rises by 37 %( approx), only less than 10% of the target trajectories are missed due to the occlusion, which is considered as better results in MTT.Table 3, 4 and 5 discusses the comparative results on the PETS2009-S2L1, PETS2009-S2L2, and CAVIAR video sequence with other methods.The proposed global optimization result has shown a better performance over the other multiple target tracking methods in terms of accuracy, target identity switching, fragmentation, etc.Here, the MTTA of the S2L1, S2L2 are 2% higher than the other methods and it is more than 5% in CAVIAR dataset thereby reducing the false positive rate, number of the mismatches and false negative rate.The mostly tracked trajectory is over 90% successfully covered by the tracker with respect to the groundtruth which is 10% higher than the assumed limit.The mostly lost trajectory is 10% to 12 % which is lower than the expected which is a good indication for the best tracking.The number of identity switching between the targets is 4% approximately and thereby the rate of fragmentation is reduced considerably.

VII. Conclusion
The performance of the all the tracking methods are discussed on the various perspectives of issues such as occlusions of targets, varying number of targets, appearance modelling, camera viewpoints and affinity matching.The results of the various methods are discussed and it is clearly understood that the proposed Global Energy Minimization and Optimization Technique is ahead of all the trackers, because it precisely localizing the target locations, considerably decreases the false positive counts, number of mismatches, and false negative rate which is an essential factor in the comparison of the MTT applications.The experimental evaluation on various complex datasets gives a better result compared to the other state-of-the-art methods.

Fig. 1 .
Fig. 1.The tracking results of GMM-BLM-KF method: Row 1 & 2 represents the tracking video sequences of daria walks and Row 3 represents the street crossing sequences.
observed, denotes the affinity of the target appearance, , , are the data terms includes the physical limitation to promote a smooth target trajectory, represents the regularized solution.15,,, kk are the parameters weight which is set according to the type of video dataset used, and it highly depends on the details of implementation.The state X* is used to minimize the energy.

O
is used to scale the unseen target, rg is the scalar.T represents the total number of targets, is the offset.x is the image coordinates of target i in frame t.

Fig. 2 .
Fig. 2. The tracking video sequence from CAVIAR dataset: Row 1 represents the tracking result of office lobby video and Row 2 represents the tracking result of shopping mall sequence.

Table 2 .
Results of GEM-OT on PETS2009 Video sequence

Table 3 .
Comparative results on the PETS2009S2L1 video sequence

Table 4 .
Comparative results on the PETS2009S2L2 video sequence Michael Kamaraj and Balakrishnan Ganesan (Comparative analysis of multiple target tracking methods)

Table 5 .
Comparative results on the CAVIAR video sequence