Deep Learning for Multi-Object Tracking

Publication:  Yuan-Kai Wang, T. M. Pan, C. E. Hu, "Single-Task Joint Learning Model for an Online Multi-Object Tracking Framework," Appl. Sci. 2024, 14(22), 10540.  DOI  

ABSTRACT 

Multi-object tracking faces critical challenges, including occlusions, ID switches, and erroneous detection boxes, which significantly hinder tracking accuracy in complex environments. To address these issues, this study proposes a single-task joint learning (STJL) model integrated into an online multi-object tracking framework to enhance feature extraction and model robustness across diverse scenarios. Employing cross-dataset training, the model has improved generalization capabilities and can effectively handle various tracking conditions. A key innovation is the refined tracker initialization strategy that combines detection and tracklet confidence, which significantly reduces the number of false positives and ID switches. Additionally, the framework employs a combination of Mahalanobis and cosine distances to optimize data association, further improving tracking accuracy. The experimental results demonstrate that the proposed model outperformed state-of-the-art methods on standard benchmark datasets, achieving superior MOTA and reduced ID switches, confirming its effectiveness in dynamic and occlusion-heavy environments.

INDEX TERMS  multi-object tracking; single-task joint learning; cross-dataset training; feature extraction; tracker initialization; cosine distance; data association; occlusion handling

STATE-OF-THE-ART COMPARISON 

In the experimental evaluation across three videos, namely PETS09-S2L1, PETS09-S2L2, and TUD-Crossing, our proposed method consistently outperformed the other state-of-the-art techniques. It excelled in metrics such as MT, ML, FN, and MOTA. The tracking results of the three videos are shown in Table 3. For PETS09-S2L1, our method achieved superior results, surpassing Deep-SORT by 2.5%, OMVT_TAAC by 12.8%, DMOT_MDP by 1.6%, and HMOT by 0.1% in terms of MOTA. For PETS09-S2L2, our method demonstrated excellent performance in MT, FN, and MOTA, outperforming EAMTT by 1.8%, AP_HWDPL by 6.1%, CDT by 14.4%, KCF by 10.3%, MDP_SubCNN by 14.6%, and CDA-DDAL by 11.3%. In the case of TUD-Crossing, our method outperformed ML, FN, and MOTA. In terms of MOTA, it outperformed CDT by 10.6%, KCF by 4.6%, MDP_SubCNN by 1.7%, TSML_CDE by 4.9%, AP_HWDPL by 2.7%, and CDA-DDAL by 1.7%.  

Overall, our method consistently achieved the best results in FN and MOTA across the three experimental videos, demonstrating its adaptability to various video scenarios and its ability to overcome diverse challenges for accurate tracking outcomes.  The practical tracking outcomes of the proposed method are showcased, accompanied by an analysis. The results for PETS09-S2L1 are depicted in Figure 10a. It is evident that our method accurately tracked all objects, even when pedestrians with ID 1 and 2 briefly stopped and temporarily occluded each other during their encounter. Despite this, our tracker maintained the continuous tracking of both pedestrians. Similarly, pedestrian ID 3 navigated around pedestrians with ID 1 and 2 after passing through a streetlight that momentarily obstructed the view. It also experienced occlusion during this process, but tracking seamlessly continued after the occlusion was cleared.  

The results for PETS09-S2L2 are shown in Figure 10b. In this video, a significant number of pedestrians walked randomly, and the pedestrian density was higher, leading to more pronounced occlusions among pedestrians. The results show that most of the targets were accurately tracked. Pedestrians located farther away from crowded areas, such as pedestrians with ID 40 and 58, were successfully tracked. Among the pedestrians within the crowded areas, apart from individual pedestrians who could not be detected owing to extensive occlusion, the majority, including pedestrians with ID 41 and 11, were also effectively tracked.  

The results for the TUD-Crossing dataset are illustrated in Figure 10c. Although this video featured fewer pedestrians, the use of a frontal view camera angle resulted in relatively larger areas of occlusion among pedestrians. The video depicted multiple pedestrians converging in the same direction, such as pedestrians with ID 2, 3, 4, 5, and 6, sometimes temporarily obscured by pedestrians moving in the opposite direction, such as pedestrian ID 1. From the images, it is evident that these pedestrians’ IDs were maintained successfully, even after brief occlusions. Additionally, pedestrian ID 1 con-tinued to be accurately tracked despite the presence of multiple pedestrians in the background.  

Table 1. Comparison of multi-object tracking results between state-of-the-art methods and our method.

Figure 1. Tracking results in (a) PETS09-S2L1, (b) PETS09-S2L2, (c) TUD-Crossing.

CONCLUSION

This study proposes an online multi-object tracking method that uses advanced deep learning techniques to train feature models with pedestrian re-identification datasets. To overcome the limitations of single datasets, we applied a single-task multi-dataset joint learning approach to enhance data diversity and model performance. Additionally, we propose an initialization method that combines detector and tracklet confidence, achieving the highest MOTA performance in the comparative analysis. Overall, the proposed method outperformed state-of-the-art techniques, and future work will focus on further improving feature models and tracking accuracy through more complex loss functions and network architectures.

While this study specifically addresses pedestrian multi-object tracking (MOT) using Single-Task Joint Learning (STJL) in moderate-scale scenarios, future work will focus on scaling the model for larger applications, including environments with higher object counts and multi-camera systems where data volume and tracking complexity are significantly increased. To support real-world deployment, we will leverage scalability and resource efficiency techniques from prior research on surveillance optimization, such as distributed computing, model parallelization, and dynamic resource allocation, to ensure that the model maintains real-time responsiveness even in resource-constrained settings. Additional strategies, including data compression and model pruning, will enable effective operation of the model on edge devices with real-time performance, facilitating smooth integration into existing video management system (VMS) standards across complex surveillance networks. Moreover, extending the model’s adaptability in handling diverse object types and responding to unpredictable conditions such as variable weather and lighting will provide a more comprehensive evaluation of its robustness in real-world scenarios. These enhancements can establish our model as a scalable, resilient solution capable of meeting the practical demands of large-scale, dynamic surveillance environments.

REFERENCES

[10] Bae, S.H.; Yoon, K.J. Confidence-Based Data Association and Discriminative Deep Appearance Learning for Robust Online Multi-Object Tracking. IEEE Trans. Pattern Anal. Mach. Intell. 2018, 40, 595–610.

[12] Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017.

[15] Chen, L.; Ai, H.; Shang, C.; Zhuang, Z.; Bai, B. Online Multi-Object Tracking with Convolutional Neural Networks. In Proceedings of the IEEE International Conference on Image Processing, Beijing, China, 17–20 September 2017.

[36] Sanchez-Matilla, R.; Poiesi, F.; Cavallaro, A. Online Multi-target Tracking with Strong and Weak Detections. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 8–16 October 2016.

[37] Kim, H.-U.; Kim, C.-S. CDT: Cooperative Detection and Tracking for Tracing Multiple Objects in Video Sequences. In European Conference on Computer Vision (ECCV); Springer: Cham, Switzerland, 2016.

[38] Xiang, Y.; Alahi, A.; Savarese, S. Learning to Track: Online Multi-object Tracking by Decision Making. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 7–13 December 2015.

[39] Wang, B.; Wang, G.; Chan, K.L.; Wang, L. Tracklet Association by Online Target-Specific Metric Learning and Coherent Dynamics Estimation.Trans. Pattern Anal. Mach. Intell. 2017, 39, 589–602.

[40] Chu, P.; Fan, H.; Tan, C.C.; Ling, H. Online Multi-Object Tracking with Instance-Aware Tracker and Dynamic Model Refreshment. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 7–11 January 2019.

[48] Le, Q.C.; Conte, D.; Hidane, M. Online Multiple View Tracking: Targets Association Across Cameras. In Proceedings of the 6th Workshop on Activity Monitoring by Multiple Distributed Sensing, Newcastle, UK, 6 September 2018.

[49] Zuo, G.; Du, T.; Ma, L. Dynamic target tracking based on corner enhancement with Markov decision process. J. Eng. 2018, 2018, 1617–1622.

[50] Liu, J.; Cao, X.; Li, Y.; Zhang, B. Online Multi-Object Tracking Using Hierarchical Constraints for Complex Scenarios. IEEE Trans. Intell. Transp. Syst. 2018, 19, 151–161.