Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking

Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks by incorporating the extraction of appearance features as auxiliary tasks through embedding Re-Identification task (ReID) into the detector, achieving a balance between inference speed and tracking performance. However, solving the competition between the detector and the feature extractor has always been a challenge. Meanwhile, the issue of directly embedding the ReID task into MOT has remained unresolved. The lack of high discriminability in appearance features results in their limited utility. In this paper, a new learning approach using cross-correlation to capture temporal information of objects is proposed. The feature extraction network is no longer trained solely on appearance features from each frame but learns richer motion features by utilizing feature heatmaps from consecutive frames, which addresses the challenge of inter-class feature similarity. Furthermore, our learning approach is applied to a more lightweight feature extraction network, and treat the feature matching scores as strong cues rather than auxiliary cues, with an appropriate weight calculation to reflect the compatibility between our obtained features and the MOT task. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks, i.e., MOT17, MOT20, and DanceTrack datasets. Specifically, on the DanceTrack test set, we achieve 56.8 HOTA, 58.1 IDF1 and 92.5 MOTA, making it the best online tracker capable of achieving real-time performance. Comparative evaluations with other trackers prove that our tracker achieves the best balance between speed, robustness and accuracy. Code is available at https://github.com/yfzhang1214/TCBTrack.

翻译：联合检测与嵌入（JDE）跟踪器通过将重识别任务嵌入检测器，将外观特征提取作为辅助任务，在推理速度与跟踪性能间取得平衡，在多目标跟踪任务中展现出优异性能。然而，检测器与特征提取器之间的竞争关系始终是待解决的难题，同时将重识别任务直接嵌入多目标跟踪的固有缺陷仍未得到解决。外观特征区分度不足导致其效用受限。本文提出一种利用互相关捕捉目标时序信息的新学习方法。特征提取网络不再仅基于单帧外观特征进行训练，而是通过利用连续帧的特征热图学习更丰富的运动特征，从而解决类间特征相似性难题。此外，我们将该学习方法应用于更轻量级的特征提取网络，并将特征匹配分数视作强线索而非辅助线索，通过设计合理的权重计算方式以反映所获特征与多目标跟踪任务的适配性。我们提出的TCBTrack跟踪器在多个公开基准测试集（包括MOT17、MOT20和DanceTrack数据集）上取得了最先进的性能。具体而言，在DanceTrack测试集上，我们实现了56.8 HOTA、58.1 IDF1和92.5 MOTA的指标，成为当前能够实现实时性能的最佳在线跟踪器。与其他跟踪器的对比实验证明，我们的方法在速度、鲁棒性与准确性之间达到了最优平衡。代码已开源：https://github.com/yfzhang1214/TCBTrack。