Tracking multiple people across multiple cameras is an open problem. It is typically divided into two tasks: (i) single-camera tracking (SCT) - identify trajectories in the same scene, and (ii) inter-camera tracking (ICT) - identify trajectories across cameras for real surveillance scenes. Many methods cater to SCT, while ICT still remains a challenge. In this paper, we propose a tracking method which uses motion cues and a feature aggregation network for template-based person re-identification by incorporating metadata such as person bounding box and camera information. We present a feature aggregation architecture called Composite Appearance Network (CAN) to address the above problem. The key structure of this architecture is called EvalNet that pays attention to each feature vector and learns to weight them based on gradients it receives for the overall template for optimal re-identification performance. We demonstrate the efficiency of our approach with experiments on the challenging multi-camera tracking dataset, DukeMTMC. We also survey existing tracking measures and present an online error metric called "Inference Error" (IE) that provides a better estimate of tracking/re-identification error, by treating SCT and ICT errors uniformly.
updated: Thu Oct 03 2019 22:48:18 GMT+0000 (UTC)
published: Thu Nov 15 2018 20:23:46 GMT+0000 (UTC)