Multi-Object Tracking (MOT) remains a vital component of intelligent video analysis, which aims to locate targets and maintain a consistent identity for each target throughout a video sequence. Existing works usually learn a discriminative feature representation, such as motion and appearance, to associate the detections across frames, which are easily affected by mutual occlusion and background clutter in practice. In this paper, we propose a simple yet effective two-stage feature learning paradigm to jointly learn single-shot and multi-shot features for different targets, so as to achieve robust data association in the tracking process. For the detections without being associated, we design a novel single-shot feature learning module to extract discriminative features of each detection, which can efficiently associate targets between adjacent frames. For the tracklets being lost several frames, we design a novel multi-shot feature learning module to extract discriminative features of each tracklet, which can accurately refind these lost targets after a long period. Once equipped with a simple data association logic, the resulting VisualTracker can perform robust MOT based on the single-shot and multi-shot feature representations. Extensive experimental results demonstrate that our method has achieved significant improvements on MOT17 and MOT20 datasets while reaching state-of-the-art performance on DanceTrack dataset.
翻译:多目标跟踪(MOT)仍是智能视频分析的关键组成部分,旨在定位目标并在整个视频序列中为每个目标保持一致的标识。现有工作通常学习判别性特征表示(如运动特征和外观特征)以跨帧关联检测结果,但在实际场景中易受相互遮挡和背景杂波的干扰。本文提出一种简单有效的两阶段特征学习范式,针对不同目标联合学习单次特征与多次特征,从而在跟踪过程中实现鲁棒的数据关联。针对未关联的检测结果,我们设计了新颖的单次特征学习模块,提取每个检测结果的判别性特征,可高效关联相邻帧间的目标;针对丢失数帧的轨迹片段,我们设计了新颖的多次特征学习模块,提取每个轨迹片段的判别性特征,可在长时间遮挡后准确找回丢失目标。通过配备简单的数据关联逻辑,所生成的VisualTracker能够基于单次与多次特征表示实现鲁棒的多目标跟踪。大量实验结果表明,我们的方法在MOT17和MOT20数据集上取得显著改进,并在DanceTrack数据集上达到最先进性能。