This article is part of the series Object Tracking and Monitoring Using Advanced Signal Processing Techniques.

Open Access Research

Multi-camera multi-object voxel-based Monte Carlo 3D tracking strategies

Cristian Canton-Ferrer*, Josep R Casas, Montse Pardàs and Enric Monte

Author Affiliations

Image and Video Processsing Group, Technical University of Catalonia, Barcelona, Spain

For all author emails, please log on.

EURASIP Journal on Advances in Signal Processing 2011, 2011:114  doi:10.1186/1687-6180-2011-114


The electronic version of this article is the complete one and can be found online at: http://asp.eurasipjournals.com/content/2011/1/114


Received:15 May 2011
Accepted:23 November 2011
Published:23 November 2011

© 2011 Canton-Ferrer et al; licensee Springer.

This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Abstract

This article presents a new approach to the problem of simultaneous tracking of several people in low-resolution sequences from multiple calibrated cameras. Redundancy among cameras is exploited to generate a discrete 3D colored representation of the scene, being the starting point of the processing chain. We review how the initiation and termination of tracks influences the overall tracker performance, and present a Bayesian approach to efficiently create and destroy tracks. Two Monte Carlo-based schemes adapted to the incoming 3D discrete data are introduced. First, a particle filtering technique is proposed relying on a volume likelihood function taking into account both occupancy and color information. Sparse sampling is presented as an alternative based on a sampling of the surface voxels in order to estimate the centroid of the tracked people. In this case, the likelihood function is based on local neighborhoods computations thus dramatically decreasing the computational load of the algorithm. A discrete 3D re-sampling procedure is introduced to drive these samples along time. Multiple targets are tracked by means of multiple filters, and interaction among them is modeled through a 3D blocking scheme. Tests over CLEAR-annotated database yield quantitative results showing the effectiveness of the proposed algorithms in indoor scenarios, and a fair comparison with other state-of-the-art algorithms is presented. We also consider the real-time performance of the proposed algorithm.

1 Introduction

Tracking multiple objects and keeping record of their identities along time in a cluttered dynamic scene is a major research topic in computer vision, basically fostered by the number of applications that benefit from the retrieved information. For instance, multi-person tracking has been found useful for automatic scene analysis [1], human-computer interfaces [2], and detection of unusual behaviors in security applications [3].

A number of methods for camera-based multi-person 3D tracking have been proposed in the literature [4-7]. A common goal in these systems is robustness under occlusions created by the multiple objects cluttering the scene when estimating the position of a target. Single-camera approaches [8] have been widely employed, but they are vulnerable to occlusions, rotation, and scale changes of the target. In order to avoid these drawbacks, multi-camera tracking techniques exploit spatial redundancy among different views and provide 3D information at the actual scale of the objects in the real world. Integration of data extracted from multiple cameras has been proposed in terms of a fusion at feature level as image correspondences [9] or multi-view histograms [10] among others. Information fusion at data or raw level has been achieved by means of voxel reconstructions [11], polygon meshes [12], etc.

Most multi-camera approaches rely on a separate analysis of each camera view, followed by a feature fusion process to finally generate an output. Exploiting the underlying epipolar geometry of a multi-camera setup toward finding the most coherent feature correspondence among views was first tackled by Mikič et al. [13] using algebraic methods together with a Kalman filter, and further developed by Focken et al. [14]. Exploiting epipolar consistency within a robust Bayesian framework was also presented by Canton-Ferrer et al. [9]. Other systems rely on detecting semantically relevant patterns among multiple cameras to feed the tracking algorithm as done in [15] by detecting faces. Particle filtering (PF) [16] has been a commonly employed algorithm because of its ability to deal with problems involving multi-modal distributions and non-linearities. Lanz et al. [10] proposed a multi-camera PF tracker exploiting foreground and color information, and several contributions have also followed this path: [4,7]. Occlusions, being a common problem in feature fusion methods, have been addressed in [17] using HMM to model the temporal evolution of occlusions within a PF algorithm. Information about the tracking scenario can also be exploited toward detecting and managing occlusions as done in [18] by modeling the occluding elements, such as furniture, in a training phase before tracking. It must be noted that, in this article, we assume that all cameras will be covering the area under study. Other approaches to multi-camera/multi-person tracking do not require maximizing the overlap of the field of view of multiple cameras, leading to the non-overlapped multi-camera tracking algorithms [19].

Multi-camera/multi-person tracking algorithms based on a data fusion before doing any analysis was pioneered by Lopez et al. [20] by using a voxela reconstruction of the scene. This idea was further developed by the authors in [5,21] finally leading to the present article. Up to our knowledge, this is the first approach to multi-person tracking exploiting data fusion from multiple cameras as the input of the algorithms. In this article, we first introduce a methodology to multi-person tracking based on a colored voxel representation of the scene as the start of the processing chain. The contribution of this article is twofold. First, we emphasize the importance of the initiation and termination of tracks, usually neglected in most tracking algorithms, that has indeed an impact on the performance of the overall system. A general technique for the initiation/termination of tracks is presented. The second contribution is the filtering step where two techniques are introduced. The first technique applies PF to input voxels to estimate the centroid of the tracked targets. However, this process is far from real-time performance and an alternative, that we call Sparse Sampling (SS). SS aims at decreasing computation time by means of a novel tracking technique based on the seminal PF principle. Particles no longer sample the state space but instead a magnitude whose expectancy produces the centroid of the tracked person: the surface voxels. The likelihood evaluation relying on occupancy and color information is computed on local neighborhoods, thus dramatically decreasing the computation load of the overall algorithm. Finally, effectiveness of the proposed techniques is assessed by means of objective metrics defined in the framework of the CLEAR [22] multi-target tracking database. Computational performance is reviewed toward proving the real-time operation of the SS algorithms. Fair comparisons with state-of-the-art methods evaluated using the same database are also presented and discussed.

2 Tracker design methodology

Typically, a multi-target tracking system can be depicted as in Figure 1 and comprises a number of elementary modules. Although most articles present techniques that contribute to filtering module, the overall architecture is rarely addressed assuming that some blocks are already available. In this section, this scheme will be analyzed and some proposals for each module will be presented. The filtering step, being our major contribution, will be addressed in a separate section.

thumbnailFigure 1. Multi-person tracking scheme.

2.1 Input and output data

When addressing the problem of multi-person tracking within a multi-camera environment, a strategy about how to process this information is needed. Many approaches perform an analysis of the images separately, and then combine the results using some geometric constraints [10]. This approach is denoted as an information combination by fusion of decisions. However, a major issue in this procedure is dealing with occlusion and perspective effects. A more efficient way to combine information is data fusion [23]. In our case, data fusion leads to a combination of information from all images to build up a new data representation, and to apply the algorithms directly on these data. Several data representations aggregating the information of multiple views have been proposed in the literature such as voxel reconstructions [11,24], level sets [25], polygon meshes [12], conexels [26], depth maps [27], etc. In our research, we opted for a colored voxel representation due to both its fast computation and accuracy.

For a given frame in the video sequence, a set of NC images are obtained from the NC cameras (see a sample in Figure 2(a)). Each camera is modeled using a pinhole camera model based on perspective projection with camera calibration information available. Foreground regions from input images are obtained using a segmentation algorithm based on Stauffer-Grimson's background learning and subtraction technique [28] as shown in Figure 2(b).

thumbnailFigure 2. Input data generation example. (a) A sample of the original images. (b) Foreground segmentation of the input images employed by the SfS algorithm. (c) Example of the binary 3D voxel reconstruction. (d) The final colored version shown over a background image.

Redundancy among cameras is exploited by means of a Shape-from-Silhouette (SfS) technique [11]. This process generates a discrete occupancy representation of the 3D space (voxels). A voxel is labeled as foreground or background by checking the spatial consistency of its projection on the NC segmented silhouettes, and finally obtaining the 3D binary reconstruction shown in Figure 2(c). We will denote this raw voxel reconstruction as <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1">View MathML</a>. The visibility of a surface voxel onto a given camera is assessed by computing the discrete ray originating from its optical center to the center of this voxel using Bresenham's algorithm and testing whether this ray intersects with any other foreground voxel. The most saturated color among pixels of the set of cameras that see a surface voxel is assigned to it. A colored representation of surface voxels of the scene is obtained, denoted as <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M2','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M2">View MathML</a>. An example of this process is depicted in Figure 2(d). It should be taken into account that, without loss of generality, other background/foreground and 3D reconstruction algorithms may be used to generate the input data to the tracking algorithm presented in this article.

The resulting colored 3D scene reconstruction is fed to the proposed system that assigns a tracker to each target and the obtained tracks are processed by a higher semantic analysis module. Information about the environment (dimensions of the room, furniture, etc.) allows assessing the validity of tracked volumes and discarding false volume detections.

Finally, the output of the overall tracking algorithm will be a number of hypotheses for the centroid position of each of the targets present in the scene.

2.2 Tracker state and filtering

One of the major challenges in multi-target tracking is the estimation of the number of targets and their positions in the scene, based on a set of uncertain observations. This issue can be addressed from two perspectives. First, extending the theory of single-target algorithms to multiple targets. This approach defines the working state space <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M3">View MathML</a> as the concatenation of the positions of all NT targets as <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M4','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M4">View MathML</a>. The difficulty here is the time variant dimensionality of this space. Monte Carlo approaches, and specifically PF approaches, to this problem have to face the exponential dependency between the number of particles required by the filter and the dimension of <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M3','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M3">View MathML</a>, turning out to be computationally infeasible. Recently, a solution based on random finite sets achieving linear complexity has been presented [29].

Multi-target tracking can also be tackled by tracking each target independently, that is to maintain NT trackers with a state space <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M5','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M5">View MathML</a>. In this case, the system attains a linear complexity with the number of targets, thus allowing feasible implementations. However, interactions among targets must be modeled in order to ensure the most independent set of tracks. This approach to multi-person tracking will be adopted in our research.

2.3 Track initiation and termination

A crucial factor in the performance of a tracking system is the module that addresses the initiation and termination of tracks. The initiation of a new tracker is independent of the employed filtering technique and only relies on the input data and the current state (position) of the tracks in the scene. On the other hand, the termination of a new tracking filter is driven by the performance of the tracker.

The initialization of a new filter is determined by the correct detection of a person in the analyzed scene. This process is crucial when tracking, and its correct operation will drive the overall system's accuracy. However, despite the importance of this step, little attention is paid to it in the design of multi-object trackers in the literature. Only few articles explicitly mention this process such as [30] that employs a face detector to detect a person or [31] that uses scout particle filters to explore the 3D space for new targets. Moreover, it is assumed that all targets in the scene are of interest, i.e., people, not accounting for spurious objects, i.e., furniture, shadows, etc. In this section, we introduce a method to properly handle the initiation and termination of filters from a Bayesian perspective.

2.3.1 Track initiation criteria

The 3D input data <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1">View MathML</a> fed to the tracking system is usually corrupted and presents a number of inaccuracies such as objects not reconstructed, mergings among adjacent blobs, spurious blobs, etc. Hence, defining a track initialization criterium based solely on the presence of a blob might lead to poor performance of the system. For instance, objects such as furniture might be wrongly detected as foreground, reconstructed and tracked. Instead, a classification of the blobs based on a probabilistic criteria can be applied during this initialization process aiming at a more robust operation. Training of this classifier is based on the development set of the used database, together with the available ground truth describing the position of the tracked objects.

Let <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M6','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M6">View MathML</a> be the ground truth positions of the NGT targets present in the scene of the development set at a given instant. Once the reconstruction <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1">View MathML</a> is available, a connected component analysis is performed over these data thus obtaining a set of K disjoint components, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7">View MathML</a>, fulfilling:

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M8','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M8">View MathML</a>

(1)

We will consider the region of influence of a target with centroid x as the ellipsoid <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M9','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M9">View MathML</a> with axis size s = (sx, sy, sz) centered at c.

A mapping is defined such that for every xj XGT a component <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7">View MathML</a> is assigned. Let us denote [x]{x,y,z} as the x, y or z coordinate of vector x. The assignation process is defined as follows: first, a region of influence <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M10','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M10">View MathML</a> with size s = (sx, sy, [xj]z) centered at c = xj is placed in the 3D space. The radii sx and sy are chosen to contain an average person, sx = sy = 30 cm. Let us define the operator |·| applied to a volume as the number of non-zero voxels contained in it. Then, the assignation is defined as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M11','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M11">View MathML</a>

(2)

that is to assign xj to the component with the largest volume enclosed in the region of influence. It must be noted that some xj might not have any <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7">View MathML</a> associated due to a wrong segmentation or faulty reconstruction of the target. Moreover, the set of components not associated to any ground truth position can be identified as spurious objects, reconstructed shadows, etc.

Finally, we have grouped the set of connected components <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M7">View MathML</a> in two categories: person and non-person. A set of features are extracted from each of these components, thus conforming the characteristics that will be used to train a person/no-person binary classifier. This set of extracted features is described in Table 1.

Table 1. Features employed by the person/no-person classifier where magnitude <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M79','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M79">View MathML</a> denotes the x, y, or z coordinates of voxel <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M1">View MathML</a>

In order to characterize the objects to be tracked and to decide the best classifier system, we have performed an exploratory data analysis [32], which will allow us to contrast the underlying hypotheses of the classifiers with the actual data. Histograms of these features are computed as shown in Figure 3 and scatter plots depicting the cross dependencies among all features are computed. Observing Figure 3, we see that some variables are easily separable, i.e., weight, height, and bounding box. Moreover, they show a low cross dependency with other features.

thumbnailFigure 3. Normalized histograms of the variables conforming the feature vector employed by the person/non-person classifier.

A number of standard binary classifiers has been tested and their performances have been evaluated, namely Gaussian, Mixture of Gaussians, Neural Networks, K-Means, PCA, Parzen and Decision Trees [33,34]. Due to the aforementioned properties of the statistic distributions of the features, some classifiers are unable to obtain a good performance, i.e., Gaussian, PCA, etc. Other classifiers require a large number of characterizing elements, such as K-Means, MoG, or Parzen. Decision trees [33] have reported the best results. Separable variables such as height, weight, and bounding box size are automatically selected to build up a decision tree that yields a high recognition rate with a precision of 0.98 and a recall of 0.99 in our test database.

Another complementary criterium employed in the initiation of new tracks is based on the current state of the tracker. It will not be allowed to create a new track if its distance to the closest target is below a threshold.

2.3.2 Track termination criteria

A target will be deleted if one of the following conditions is fulfilled:

- If two or more tracks fall too close to one another, this indicates that they might be tracking the same target, hence only one will be kept alive while the rest will be removed.

- If tracker's efficiency becomes very low it might indicate that the target has disappeared and should be removed.

- The person/no-person classifier is applied to the set of features extracted from the voxels assigned to a target. If the classifier outputs a no-person verdict for a number of frames, the target will be considered as lost.

3 Voxel-based solutions

The filtering block shown in Figure 1 addresses the problem of keeping consistent trajectories of the tracked objects, resolving crossings among targets, mergings with spurious objects (i.e., shadows) and producing an accurate estimation of the centroid of the target based on the input voxel information. Although there is a number of papers addressing the problem of multi-camera/multi-person tracking, very few contributions have been based on voxel analysis [20,21].

3.1 PF tracking

PF is an approximation technique for estimation problems where the variables involved do not hold Gaussianity uncertainty models and linear dynamics. The current tracking scenario can be tackled by means of this algorithm to estimate the 3D position of a person xt = (x, y, z)t at time t, taking as observation a set of colored voxels representing the 3D scene up to time t denoted as z1: t. For a given target xt, PF approximates the posterior density p(xt|z1:t) as a sum of Np Dirac functions:

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M17','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M17">View MathML</a>

(3)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M18">View MathML</a> are the weights associated to the particles, fulfilling <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M19','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M19">View MathML</a>, and <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M20">View MathML</a> their positions. For this type of tracking problem, a sampling importance re-sampling (SIR) PF is applied to drive particles along time [16]. Assuming importance density to be equal to the prior density, weight update is recursively computed as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M21','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M21">View MathML</a>

(4)

SIR PF avoids the particle degeneracy problem by re-sampling at every time step. In this case, weights are set to <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M22','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M22">View MathML</a>j; therefore,

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M23','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M23">View MathML</a>

(5)

Hence, the weights are proportional to the likelihood function that will be computed over the incoming volume zt.

Finally, the best state at time t, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M24','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M24">View MathML</a>, is derived based on the discrete approximation of Equation 3. The most common solution is the Monte Carlo approximation of the expectation as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M25','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M25">View MathML</a>

(6)

Basically, in the PF operation loop two steps must be defined: likelihood evaluation and particles propagation. In the following, we present our proposal for the PF implementation.

3.1.1 Likelihood evaluation

Binary and color information contained in zt will be employed to define the likelihood function <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M26">View MathML</a> relating the observation zt with the human body instance given by particle <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M27','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M27">View MathML</a>. Two partial likelihood functions, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M28','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M28">View MathML</a> and <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M29','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M29">View MathML</a>, will be combined linearly to produce <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M26','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M26">View MathML</a> as:

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M30','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M30">View MathML</a>

(7)

Factor λ controls the influence of each term (foreground and color information) in the overall likelihood function. Empirical tests have shown that λ = 0.8 provides satisfactory results. A more detailed review of the impact of color information in the overall performance of the algorithm is addressed in Section5.1.

Likelihood associated to raw data is defined as the ratio of overlap between the input data <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M31','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M31">View MathML</a> and the ellipsoid <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M32','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M32">View MathML</a> defined by particle <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M20','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M20">View MathML</a> (see Section 2.3.1)

as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M33','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M33">View MathML</a>

(8)

For a given target k, an adaptive reference histogram <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34">View MathML</a> of the colored surface voxels is available. This histogram is constructed using the YCbCr color space due to its robustness against light variations. The number of bins per channel will drive the ability of the system to distinguish between different color blobs; for our experiments, 21 bins per channel have been set empirically. The color likelihood function is constructed as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M35','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M35">View MathML</a>

(9)

where B(·) is the Bhattacharya distance and H(·) stands for the color histogram extraction operation of the enclosed volume. Update of the reference histogram is performed in a linear manner following the rule:

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M36','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M36">View MathML</a>

(10)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M37','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M37">View MathML</a> stands for the ellipsoid placed in the centroid estimation <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M38">View MathML</a> and α is the adaptation coefficient. In our experiments, α = 0.9 provided satisfactory results.

3.1.2 Particle propagation

The propagation model has been chosen to be a Gaussian noise added to the state of the particles after the re-sampling step: <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M39','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M39">View MathML</a>. The covariance matrix P corresponding to N is proportional to the maximum variation of the centroid of the target and this information is obtained from the development part of the testing dataset. More sophisticated schemes employ previously learnt motion priors to drive the particles more efficiently [6]. However, this would penalize the efficiency of the system when tracking unmodeled motions patterns and, since our algorithm is intended for any motion tracking, no dynamical model is adopted.

3.1.3 Interaction model

Let us assume that there are NT independent tracked targets. However, they are not fully independent since each tracker can consider voxels from other targets in both the likelihood evaluation and the 3D re-sampling step, resulting in target merging or identity mismatches. In order to achieve the most independent set of trackers, a blocking method to model interactions is considered. Some blocking proposals can be found in 2D tracking related studies [6] and an extension to the 3D domain is proposed. Blocking methods rely on penalizing particles whose associated ellipsoid model overlaps with other targets' ellipsoid as shown in Figure 4. Hence, blocking information can also be considered when computing the particle weights for the kth target as

thumbnailFigure 4. Particles from the tracker A (yellow ellipsoid) falling into the exclusion zone of tracker B (green ellipsoid) will be penalized by a multiplicative factor α ∈ [0, 1].

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M40','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M40">View MathML</a>

(11)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M41','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M41">View MathML</a> stands for the estimation of the PF at time t - 1 for target k and ϕ(·) is the blocking function defining exclusion zones that penalize particles from target l falling into the exclusion zone of target k. In this particular case, considering that people in the room are always sitting or standing up, this zone can be constrained to the xy plane. The proposed function is

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M42','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M42">View MathML</a>

(12)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M43','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M43">View MathML</a> is the parameter that drives the sensibility of the exclusion zone.

3.2 SS tracking

In the presented PF tracking algorithm, likelihood evaluation can be computationally expensive, thus rendering this approach unsuitable for real-time systems. Moreover, data are usually noisy and may contain merged blobs corresponding to different targets. A new technique, SS, is proposed as an efficient and flexible alternative to PF.

Assuming a homogeneous 3D object, it can be proved that its centroid can exactly be computed based only on the surface voxels, since the interior voxels do not provide any relevant information. Hence, this centroid can be estimated through a discrete version of Green's theorem on the surface voxels [35,36], while other approaches obtain an accurate approximation of the centroid using feature points (see [37] for a review). A common assumption of these techniques is the availability of surface data extracted beforehand, hence a labeling of the voxels in the scene should be available. By assuming that the object under study presents a central symmetry in the xy plane, the computation of the centroid can be done as an average of the positions of the surface voxels:

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M44','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M44">View MathML</a>

(13)

3.2.1 Degree of mass and degree of surfaceness

Let us model the human body as an ellipsoid as previously done in the PF approach. In order to test the robustness of the centroid computation of Equation13 against missing data, we studied the error committed when only a fraction of these input data is employed. A number of voxels (surface or interior voxels in each case) is randomly selected and employed to compute the centroid. Then, the error is computed showing that the surface-based estimation is more sensitive than the estimation using interior voxels (see Figure 5). However, this proves that the centroid can be computed from a number of randomly selected surface voxels still achieving a satisfactory performance. This idea is the underlying principle of the SS algorithm.

thumbnailFigure 5. Centroid's estimation error when computed with a fraction of surface or interior voxels. The employed ellipsoid had a radii s = (30, 30,100) cm, and voxels with sv = 2 cm were used.

Let us estimate the centroid of an object by analyzing a randomly selected number of voxels from the whole scene, denoted as <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a>. An approach to the computation of the centroid would be

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M46','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M46">View MathML</a>

(14)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M47">View MathML</a> gives the mass density of voxel <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a>. Since it is assumed that all voxels have the same mass, this is a binary function that checks the occupancy of a given voxel. Hence, only the fraction of (randomly selected) voxels inside the object will contribute to the computation of the centroid. Equation14 can be rewritten as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M48','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M48">View MathML</a>

(15)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M49','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M49">View MathML</a> can be considered as the normalized mass contribution of voxel <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a> to the computation of the centroid. If function <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M47">View MathML</a> takes values in the range [0,1] we may consider it as the "degree of mass" of <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a> or the importance of voxel <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a> into the calculation of <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M38">View MathML</a>. Then, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M47','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M47">View MathML</a> might be considered as a normalized weight assigned to <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a>. Since we stated that the centroid can be computed using surface voxels, Equation13 can be also posed as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M50','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M50">View MathML</a>

(16)

where <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M51','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M51">View MathML</a> measures the "degree of surfaceness" of voxel <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M45">View MathML</a>. Within this context, functions ρ(·) and ρS(·) might be understood as pseudo-likelihood functions and Equations 16 and 15 as a sample-based representation of an estimation problem.

3.2.2 Difference with particle filters

There is an obvious similarity between these representation and the formulation of particle filters but there is a significant difference. While particles in PF represent an instance of the whole body, our samples <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M52','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M52">View MathML</a> are points in the 3D space. Moreover, particle likelihoods are computed over all data while sample pseudo-likelihoods will be computed in a local domain.

The presented concepts are applied to define the SS algorithm. Let <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M53','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M53">View MathML</a>, a point in the 3D space and <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M54','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M54">View MathML</a> its associated weight measuring the pseudo-likelihood of this position being part of the object or part of its surface. Under certain assumptions, it is achieved that the centroid can be computed as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M55','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M55">View MathML</a>

(17)

where Ns is the number of sampling points. When using SS we are no longer sampling the state space since <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a> cannot be considered an instance of the centroid of the target as happened with particles, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M57">View MathML</a>, in PF. Hence, we will talk about samples instead of particles and we will refer to <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M58">View MathML</a> as the sampling set. This set will approximate the surface of the kth target, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M59','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M59">View MathML</a>, and will fulfill the sparsity condition <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M60','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M60">View MathML</a>.

4 SS implementation

In order to define a method to recursively estimate <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M38','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M38">View MathML</a> from the sampling set <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M58','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M58">View MathML</a>, a filtering strategy has to be set. Essentially, the proposal is to follow the PF analysis loop (re-sampling, propagation, evaluation, and estimation) with some opportune modifications to ensure the convergence of the algorithm.

4.1 Pseudo-likelihood evaluation

Associated weight <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M18','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M18">View MathML</a> to a sample <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a> will measure the likelihood of that 3D position to be part of the surface of the tracked target. When computing the pseudo-likelihood, surface has been chosen instead of interior voxels, based on the efficiency of surface samples to propagate rapidly as will be explained in the next section. As in the defined PF likelihood function, two partial pseudo-likelihood functionsb, <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M61','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M61">View MathML</a> and <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M62">View MathML</a>, are linearly combined to form <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M63">View MathML</a> as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M64','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M64">View MathML</a>

(18)

Partial likelihoods will be computed on a local domain centered in the position <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a>. Let <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M65','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M65">View MathML</a> be a neighborhood of radius r over a connectivity q domain on the 3D orthogonal grid around a sample place in a voxel position <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a>. Then, we define the occupancy and color neighborhoods around <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a> as <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M66','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M66">View MathML</a> and <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M67','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M67">View MathML</a>, respectively.

For a given sample <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a> occupying a single voxel, its weight associated to the raw data will measure its likelihood to belong to the surface of an object. It can be modeled as

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M68','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M68">View MathML</a>

(19)

Ideally, when the sample <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a> is placed in a surface, half of its associated occupancy neighborhood will be occupied and the other half empty. The proposed expression attains its maximum when this condition is fulfilled.

Function <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M62','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M62">View MathML</a> can be defined as the likelihood of a sample belonging to the surface corresponding to the kth target characterized by an adaptive reference color histogram <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34">View MathML</a>:

<a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M69','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M69">View MathML</a>

(20)

Since <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M70">View MathML</a> contains only local color information with reference of the global histogram <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34">View MathML</a>, the distance D(·) is constructed toward giving a measure of the likelihood between this local colored region and <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M34">View MathML</a>. For every voxel in <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M70','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M70">View MathML</a>, it is decided whether it is similar to <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M71">View MathML</a> by selecting the histogram value for the tested color and checking whether it is above a threshold γ or not. Finally, the ratio between the number of similar color and total voxels in the neighborhood gives the color similarity score. Since reference histogram is updated and changes over time, a variable threshold γ is computed, so that the 80% of the values of <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M71','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M71">View MathML</a> are taken into account.

One of the advantages of the SS algorithm is its computational efficiency. The complexity to compute <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M63','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M63">View MathML</a> is quite reduced since it only evaluates a local neighborhood around the sample in comparison with the computational load required to evaluate the likelihood of a particle in the PF algorithm. This point will be quantitatively addressed in Section5.2.

The parameters defining the neighborhood were set to q = 26 and r = 2 yielding to satisfactory results. Larger values of the radius r did not significantly improve the overall algorithm performance but increased its computational complexity.

4.2 Sample propagation and 3D discrete resampling

A sample <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M56">View MathML</a> placed near a surface will have an associated weight <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M72','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M72">View MathML</a> with a high value. It is a valid assumption to consider that some surrounding positions might also be part of this surface. Hence, placing a number of new particles in the vicinity of <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M57','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M57">View MathML</a> would contribute to progressively explore the surface of a voxel set. This idea leads to the spatial re-sampling and propagation scheme that will drive samples along time in the surface of the tracked target.

Given the discrete nature of the 3D voxel space, it will be assumed that every sample is constrained to occupy a single voxel or discrete 3D coordinate and there cannot be two samples placed in the same location. Re-sampling is mimicked from PF so a number of replicas proportional to the normalized weight of the sample are generated. Then, these new samples are propagated and some discrete noise is added to their position meaning that their new positions are also constrained to occupy a discrete 3D coordinate (see an example in Figure 6). However, two re-sampled and propagated particles may fall in the same 3D voxel location as shown in Figure 6. In such case, one of these particles will randomly explore the adjacent voxels until reaching an empty location; if there is not any suitable location for this particle, it will be dismissed.

thumbnailFigure 6. Example of discrete re-sampling and propagation (in 2D). (a) A sample is re-sampled and its replicas are randomly placed occupying a single voxel. (b) Two re-sampled samples fall in the same position (red cell) and one of them (blue) performs a random search through the adjacent voxels to find an empty location.

The choice of sampling the surface voxels of the object instead of its interior voxels to finally obtain its centroid is motivated by the fact that propagating samples along the surface rapidly spread them all around the object as depicted in Figure 7. Propagating samples on the surface is equivalent to propagate them on a 2D domain, hence the condition of not placing two samples in the same voxel will make them to explore the surface faster (see Figure 6). On the other hand, interior voxels propagate on a 3D domain, thus having more space to explore and therefore becoming slower to spread all around the volume (see Figure 6). Although both (pseudo-)likelihoods should produce a fair estimation of the object's centroid, both sampling sets must fulfill the condition to be randomly spread around the object volume, otherwise the centroid estimation will be biased.

thumbnailFigure 7. Sample positions evolution and centroid estimation. Likelihood based on: (a) interior voxels, or (b) surface voxels.

4.2.1 Interaction model

The flexibility of a sample-based analysis may, sometimes, lead to situations where particles spread out too much from the computed centroid. In order to cope with this problem, a intra-target samples' interaction model is devised. If a sample is placed in a position such that <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M73','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M73">View MathML</a> it will be removed (that is to assign <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M74','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M74">View MathML</a>) and we set the threshold as δ = αsx, with sx = 30 cm. Factor α = 1.5 produced accurate results in our experiments.

The interaction among targets is modeled in similar way as in the PF approach. Formulas in Equations 11 and 12 are applied to samples with the appropriate scaling parameter k.

5 Results and evaluation

In order to assess the performance of the proposed tracking systems, they have been tested on the set of benchmarking image sequences provided by the CLEAR Evaluation Campaigns 2007 [22]. Typically, these evaluation sequences involved up to five people moving around in a meeting room. This benchmarking set was formed by two separate datasets, development, and evaluation, containing sequences recorded by five of the participating partners. A sample of these data can be seen in Figure 8. The development set consisted in 5 sequences of an approximate duration of 20 min each, while the evaluation set was formed by 40 sequences of 5min each, thus adding up to 5 h of data. Each sequence was recorded with four cameras placed in the corners of the room and a zenithal camera placed in the ceiling. All cameras were calibrated and had resolutions ranging from 640 × 480 to 756 × 576 pixels at an average frame rate of fR = 25fps. The test environments were a 5 × 4 m rooms with occluding elements such as tables and chairs. Images of the empty rooms were also provided to train the background/foreground segmentation algorithms.

thumbnailFigure 8. CLEAR [22]evaluation dataset sample. Images from several partners showing a common indoor conference room configuration involving several participants.

Metrics proposed in [4] for multi-person tracking evaluation have been adopted, namely the Multiple Object Tracking Precision (MOTP), which shows tracker's ability to estimate precise object positions, and the Multiple Object Tracking Accuracy (MOTA), which expresses its performance at estimating the number of objects, and at keeping consistent trajectories. MOTP scores the average metric error when estimating multiple target 3D centroids, while MOTA evaluates the percentage of frames where targets have been missed, wrongly detected or mismatched.

The aim of a tracking system would be to produce high values of MOTA and low values of MOTP thus indicating its ability to correctly track all targets and estimate their positions accurately. When comparing two algorithms, there will be a preference to choose the one outputting the highest MOTA score.

5.1 Results

To demonstrate the effectiveness of the proposed multi-person tracking approaches, a set of experiments were conducted over the CLEAR 2007 database. The development part of the dataset was used to train the initiation/termination of tracks modules as described in Section 2.3 and the remaining test part was used for our experiments.

First, the multi-camera data are pre-processed performing the foreground and background segmentations and 3D voxel reconstruction algorithm. In order to analyze the dependency of the tracker's performance with the resolution of the 3D reconstruction, several voxel sizes were employed <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M75','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M75">View MathML</a> cm. A colored version of these voxel reconstructions was also generated, according to the technique introduced in Section 2.1. Then, these data were the input fed to the PF and SS proposed approaches.

In both types of filters, SS or PF, three parameters drive the performance of the algorithm: the voxel size <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M76">View MathML</a>, the number of samples Ns, or particles Np, and the usage of color information. Experiments carried out explore the influence of these parameters in the MOTP, precision in cm., and MOTA, tracker accuracy (in % of correctly tracked targets), shown in Figure 9. Some remarks can be drawn

thumbnailFigure 9. MOTP and MOTA scores for the SS and the PF techniques using raw and colored voxels. Several voxel sizes sv = {2, 5, 10, 15} cm have been used in the experiments.

- Number of samples/particles: There is a dependency between the MOTP score and the number of particles/samples, especially for the SS algorithm. The contribution of a new sample to the estimation of the centroid in the SS has less impact than the addition of a new particle in the PF, hence the slower decay of the MOTP curves for the SS than for the PF. Regarding the MOTA score, there is not a significant dependency with Ns or Np. Two factors drive the MOTA of an algorithm: the track initiation/termination modules, that mainly contributes to the ratio of misses and false positives, and the filtering step that has an impact to the mismatches ratio. The low dependency of MOTA with Ns or Np shows that most of the impact of the algorithm in this score is due to the particle/sample propagation and interaction strategies rather than the quantity of particles/samples itself. Moreover, the influence in the MOTA score is tightly correlated with the track initiation/termination policy. This assumption was experimentally validated by testing several classification methods (mixture of Gaussians, PCA, Parzen, and K-Means) in the initiation/termination modules yielding to a drop in the MOTA score proportional to their ability to correctly classify a blob as person/no-person.

- Voxel size: Scenes reconstructed with a large voxel size do not capture well all spatial details and may miss some objects thus decreasing the performance of the system (both in SS and PF). It can be observed that MOTP and MOTA scores improve as the voxel size decrease.

- Color features: Color information improves the performance of SS and PF in both MOTP and MOTA scores. First, there is an improvement when using color information for a given voxel size, specially for the SS algorithm. Moreover, the smaller the voxel size the most noticeable difference between the experiments using raw and color features. This effect is supported by the fact that color characteristics are better captured when using small voxel sizes. The performance improvement when using color in the SS algorithm is more noticeable since samples are placed in the regions with a high likelihood to be part of the target. For instance, this effect is more evident in cases where the subject is sitting and the particles concentrate in the upper body part, disregarding the part of the chair. In the SS algorithm, MOTP score benefits from this efficient sample placement. PF algorithm is constrained to evaluate the color likelihood in the ellipsoid defined in Equation 9 thus not being able to differentiate between parts of the blob that do not belong to the tracked target. Color information used within the filtering loop leads to a better distinguishability among blobs, thus reducing the mismatch ratio and slightly improving the MOTA score. Merging of adjacent blobs or complex crossing among targets is also correctly resolved. An example of the impact of color information is shown in Figure 10 where the usage of color avoids the mismatch between two targets. This effect is more noticeable when targets in the scene are dressed in different colors.

thumbnailFigure 10. Zenithal view of two comparative experiments showing the influence of color in the SS algorithm. The cross-over between two targets is correctly tackled when using color information whereas using only raw features leads to a mismatch and, afterwards, a track loss (white ellipsoid) and the initiation of a new one (cyan ellipsoid).

We can compare the results obtained by SS and PF with other algorithms evaluated using the same CLEAR 2007 database whose scores are reported in Table 2. Most of these methods exploited multi-view information with the exception of [31] that only used the zenithal camera facing the associated distortion and perspective problems. PF is the most employed technique due to its suitability to the characteristics of this problem although Kalman filtering used by [15] provided fair results when fed by higher semantical features extracted from the input data (in this case, faces). Note the low FP score for this system as a consequence of the unlikely event of detecting a face in a spurious object. A 3D voxel reconstruction was used as the input data in [5] together with a simple track management system. The rest of the methods [7,31] relied on a fixed human body appearance model similar to the ellipsoidal region of interest used in our PF proposal. However, the novelty of these methods is the strategies to combine the information coming from the analysis of different views without performing any 3D reconstruction. Comparing the best proposed tracking system [31]c with our two approaches, we obtain a relative improvement of Δ(MOTP, MOTA)ss = (7.63,17.13)% and Δ(MOTP, MOTA)PF = (5.16,7.15)%.

Table 2. Results presented at the CLEAR 2007 [22] by several partners

In order to visually show the performance of the SS algorithm, some videos corresponding to the most challenging tracking scenarios have been made available at http://www.cristiancanton.org webcite.

5.2 Computational performance

Comparing obtained metrics among different algorithms can give an idea about their performance in a scenario where computational complexity is not taken into account. An analysis of the operation time of several algorithms under the same conditions and the produced MOTP/MOTA metrics might give a more informative and fairer comparison tool. Although there is not a standard procedure to measure the computational performance of a tracking process, we devised a method to assess the computational efficiency of our algorithms to present a comparative study.

The RTFfactor associated with a performance measure MOTP/MOTA (in both vertical axes) of the SS and PF algorithms when dealing with raw and colored input voxels is presented in Figure 11. This factor indicates a proportional measure of the speed of the algorithm where RTF = 1 stands for real-time operation while RTF > 1 and RTF < 1 indicate a faster or slower performance, respectively. Each point of every curve is the result of an experiment conducted over all the CLEAR data set associated to a number of samples/particles of each algorithm.

thumbnailFigure 11. Computational performance comparison between PF and SS using several voxel sizes <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M78','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M78">View MathML</a> and features (raw or colored voxels). MOTP and MOTA scores are related to the real-time factor (RTF) showing the computational load required by each algorithm to attain a given tracking performance.

The first noticeable characteristic of these charts is that, due to the computational complexity of each algorithm, when comparing SS and PF algorithms under the same operation conditions, the RTF associated with SS is always higher than the associated with PF. Similarly, the computational load is higher when analyzing colored than raw inputs. All the plotted curves attain lower RTF performance values as the size of the voxel <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M76','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M76">View MathML</a> decreases since the amount of data to process increases (note the different RTF scale ranges for each voxel size in Figure 11). Regarding the MOTP/MOTA metrics, there is a common tendency to a decrease in the MOTP and an increase in the MOTA as the RTF decreases. The separation between the SS and PF curves is bigger as the voxel size decreases since the PF algorithm has to evaluate a larger amount of data.

The observation of these results yields the conclusion that the SS algorithm is able to produce a similar and, in some cases, better results than the PF algorithm with a lower computational cost. For example, using <a onClick="popup('http://asp.eurasipjournals.com/content/2011/1/114/mathml/M77','MathML',630,470);return false;" target="_blank" href="http://asp.eurasipjournals.com/content/2011/1/114/mathml/M77">View MathML</a> cm, a MOTP score of around 165 mm can be obtained using SS with a RTF ten times larger than when using PF and similarly with the MOTA score.

6 Conclusions

In this article, we have presented a number of contributions to the multi-person tracking task in a multi-camera environment. A block representation of the whole tracking process allowed to identify the performance bottlenecks of the system and address efficient solutions to each of them. Real-time performance of the system was a major goal hence efficient tracking algorithms have been produced as well as an analysis of their performance.

The performance of these systems has thoroughly been tested over the CLEAR database and quantitatively compared through two scores: MOTP and MOTA. A number of experiments have been conducted toward exploring the influence of the resolution of the 3D reconstruction and the color information. Results have been compared with other state-of-the-art algorithms evaluated with the same metrics using the same testing data.

The relevance of the initiation and termination of filters have been proved, since these modules have a major impact on the MOTA score. However, most articles in the literature do not specifically address the operation of these modules. We proposed a statistical classifier based on classification trees as a way to discriminate blobs between the person/no-person classes. Training of this classifier was done using data available in the development part of the employed database and a number of features (namely weight, height, top in z-axis, bounding box size) were extracted and provided as the input to the classifier. Another criterium such as a proximity to other already existing tracks was employed to create or destroy a track. Performance scores in Table 2 for the PF and SS systems present the lowest values for the false positives (FP) and missed targets (Miss) ratios hence supporting the relevance of the initiation and termination of tracks modules.

Two proposals for the filtering step of the tracking system have been presented: PF and SS. An independent tracker was assigned to every target and an interaction model was defined. PF technique proved to be robust and leaded to state-of-the-art results but its computational load was unaffordable for small voxel sizes. As an alternative, SS algorithm has been presented achieving a similar and, in some occasions, better performance than PF at a smaller computational cost. Its sample-based estimation of the centroid allowed a better adaptation to noisy data and distinguishability among merged blobs. In both PF and SS, color information provided a useful cue to increase the robustness of the system against track mismatches thus increasing the MOTA score. In the SS, color information also allowed a better placement of the samples allowing to distinguish among parts belonging to the tracked object and parts of a merging with a spurious object, leading to a better MOTP score.

Future research within this topic involves multi-modal data fusion with audio data toward improving the precision of the tracker, MOTP, and avoid mismatches among targets, thus improving the MOTA score.

End notes

aAnalogously to the pixel definition (picture element) as the minimum information unit in a discrete image, the voxel (volume element) is defined as the minimum information unit in a 3D discrete representation of a volume.

bFor the sake of simplicity in the notation, pseudo-likelihood functions will be denoted as p(·) instead of defining a specific notation for it.

cWhen selecting the best system, the MOTA score is regarded as the most significant value.

The authors declare that they have no competing interests

References

  1. S Park, MM Trivedi, Understanding human interactions with track and body synergies captured from multiple views. Comput Vis Image Understand 111(1), 2–20 (2008). Publisher Full Text OpenURL

  2. Project CHIL--Computers in the Human Interaction Loop. [http://chil.server.de] webcite

  3. I Haritaoglu, D Harwood, LS Davis, W4: real-time surveillance of people and their activities. IEEE Trans Pattern Anal Mach Intell 22(8), 809–830 (2000). Publisher Full Text OpenURL

  4. K Bernardin, A Elbs, R Stiefelhagen, Multiple object tracking performance metrics and evaluation in a smart Room environment. Proceedings of IEEE International Workshop on Visual Surveillance (2006)

  5. C Canton-Ferrer, J Salvador, JR Casas, Multi-person tracking strategies based on voxel analysis. Proceedings of Classification of Events, Activities and Relationships Evaluation and Workshop (Lecture Notes on Computer Science, 2007) 4625, pp. 91–103

  6. Z Khan, T Balch, F Dellaert, Efficient particle filter-based tracking of multiple interacting targets using an MRF-based motion model. Proceedings of International Conference on Intelligent Robots and Systems 1(1), 254–259 (2003)

  7. O Lanz, P Chippendale, R Brunelli, An appearance-based particle filter for visual tracking in smart rooms. Proceedings of Classification of Events, Activities and Relationships Evaluation and Workshop (Lecture Notes on Computer Science, 2007) 4625, pp. 57–69

  8. A Yilmaz, O Javed, M Shah, Object tracking: a survey. ACM Comput Surv 38(4), 1–45 (2006)

  9. C Canton-Ferrer, JR Casas, M Pardàs, Towards a Bayesian approach to robust finding correspondences in multiple view geometry environments. Proceedings of 4th International Workshop on Computer Graphics and Geometric Modelling (Lecture Notes on Computer Science, 2005) 3515, pp. 281–289

  10. O Lanz, Approximate Bayesian multibody tracking. IEEE Trans Pattern Anal Mach Intell 28(9), 1436–1449 (2006). PubMed Abstract | Publisher Full Text OpenURL

  11. GKM Cheung, T Kanade, JY Bouguet, M Holler, A real time system for robust 3D voxel reconstruction of human motions. IEEE Conference on Computer Vision and Pattern Recognition 2, 714–720 (2000)

  12. J Isidoro, S Sclaroff, Stochastic refinement of the visual hull to satisfy photometric and silhouette consistency constraints. Proceedings of IEEE International Conference on Computer Vision 2, 1335–1342 (2003)

  13. I Mikič, S Santini, R Jain, Tracking objects in 3D using multiple camera views, in. Proceedings of Asian Conference on Computer Vision (2000)

  14. D Focken, R Stiefelhagen, Towards vision-based 3-D people tracking in a Smart Room. Proceedings of IEEE International Conference on Multimodal Interfaces, 400–405 (2002)

  15. N Katsarakis, F Talantzis, A Pnevmatikakis, L Polymenakos, The AIT 3D audiovisual person tracker for CLEAR 2007. Proceedings of Classification of Events, Activities and Relationships Evaluation and Workshop (Lecture Notes on Computer Science, 2007) 4625, pp. 35–46

  16. MS Arulampalam, S Maskell, N Gordon, T Clapp, A tutorial on particle filters for online nonlinear/non-Gaussian Bayesian tracking. IEEE Trans Signal Process 50(2), 174–188 (2002). Publisher Full Text OpenURL

  17. K Lien, C Huang, Multiview-based cooperative tracking of multiple human objects. EURASIP J. Image Video Process 8(2), 1–13 (2008)

  18. T Osawa, X Wu, K Sudo, K Wakabayashi, H Arai, MCMC based multi-body tracking using full 3D model of both target and environment. Proceedings of IEEE Conference on Advanced Video and Signal Based Surveillance, 224–229 (2007)

  19. J Black, T Ellis, P Rosin, Multi view image surveillance and tracking. Proceedings of Workshop on Motion and Video Computing, 169–174 (2002)

  20. A López, C Canton-Ferrer, JR Casas, Multi-person 3D tracking with particle filters on voxels. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 1, 913–916 (2007)

  21. C Canton-Ferrer, R Sblendido, JR Casas, M Pardàs, Particle Filtering and sparse sampling for multi-person 3D tracking. Proceedings of IEEE International Conference on Image Processing, 2644–2647 (2008)

  22. CLEAR--Classification of Events, Activities and Relationships Evaluation and Workshop. [http://www.clear-evaluation.org] webcite

  23. DL Hall, SAH McMullen, Mathematical Techniques in Multisense Data Fusion. Artech House (2004)

  24. KN Kutulakos, SM Seitz, A theory of shape by space carving. Int J Comput Vis 38(3), 199–218 (2000). Publisher Full Text OpenURL

  25. O Faugeras, R Keriven, Variational principles, surface evolution, PDE's, level set methods and the stereo problem. Proceedings of 5nd IEEE EMBS International Summer School on Biomedical Imaging (2002)

  26. JR Casas, J Salvador, Image-based multi-view scene analysis using conexels. Proceedings of HCSNet Workshop on Use of Vision in Human-Computer Interaction, 19–28 (2006)

  27. V Kolmogorov, R Zabin, What energy functions can be minimized via graph cuts? IEEE Trans Pattern Anal Mach Intell 26(2), 147–159 (2004). PubMed Abstract | Publisher Full Text OpenURL

  28. C Stauffer, W Grimson, Adaptive background mixture models for real-time tracking, in. Proceedings of IEEE International Conference on Computer Vision and Pattern Recognition, 252–259 (1999)

  29. E Maggio, E Piccardo, C Regazzoni, A Cavallaro, Particle PHD filtering for multi-target visual tracking. Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing 1, 1101–1104 (2007)

  30. F Talantzis, A Pnevmatikakis, AG Constantinides, Audio-visual active speaker tracking in cluttered indoors environments. IEEE Trans Syst Man Cybern B 38(3), 799–807 (2008)

  31. K Bernardin, T Gehrig, R Stiefelhagen, Multi-level particle filter fusion of features and cues for audio-visual person tracking. Proceedings of Classification of Events, Activities and Relationships Evaluation and Workshop (Lecture Notes on Computer Science, 2007) 4625, pp. 70–81

  32. JW Tuckey, Exploratory Data Analysis. Addison-Wesley (1977)

  33. L Breiman, JH Friedman, RA Olshen, CJ Stone, Classification and Regression Trees. Chapman and Hall (1993)

  34. RO Duda, PE Hart, DG Stork, Pattern Classification. Wiley-Interscience (2000)

  35. JJ Crisco, RD McGovern, Efficient calculation of mass moments of intertia for segmented homogeneous three-dimensional objects. J Biomech 31(1), 97–101 (1998). PubMed Abstract | Publisher Full Text OpenURL

  36. JG Leu, Computing a shape's moments from its boundary. Pattern Recogn 24(10), 116–122 (1991)

  37. L Yang, F Albregtsen, Fast and exact computation of Cartesian geometric moments using discrete Green's theorem. Pattern Recogn 29(7), 1061–1073 (1996). Publisher Full Text OpenURL