Multi-object tracking (MOT) is a challenging vision task that aims to detect individual objects within a single frame and associate them across multiple frames. Recent MOT approaches can be categorized into two-stage tracking-by-detection (TBD) methods and one-stage joint detection and tracking (JDT) methods. Despite the success of these approaches, they also suffer from common problems, such as harmful global or local inconsistency, poor trade-off between robustness and model complexity, and lack of flexibility in different scenes within the same video. In this paper we propose a simple but robust framework that formulates object detection and association jointly as a consistent denoising diffusion process from paired noise boxes to paired ground-truth boxes. This novel progressive denoising diffusion strategy substantially augments the tracker's effectiveness, enabling it to discriminate between various objects. During the training stage, paired object boxes diffuse from paired ground-truth boxes to random distribution, and the model learns detection and tracking simultaneously by reversing this noising process. In inference, the model refines a set of paired randomly generated boxes to the detection and tracking results in a flexible one-step or multi-step denoising diffusion process. Extensive experiments on three widely used MOT benchmarks, including MOT17, MOT20, and Dancetrack, demonstrate that our approach achieves competitive performance compared to the current state-of-the-art methods.
翻译:多目标跟踪(MOT)是一项具有挑战性的视觉任务,旨在检测单帧中的单个目标并将其跨帧关联。近期MOT方法可分为两阶段检测跟踪法(TBD)和单阶段联合检测跟踪法(JDT)。尽管这些方法取得了成功,但仍存在常见问题,例如有害的全局或局部不一致性、鲁棒性与模型复杂性之间的权衡不佳,以及在同一视频的不同场景中缺乏灵活性。本文提出一种简单而鲁棒的框架,将目标检测与关联联合建模为从成对噪声框到成对真值框的一致去噪扩散过程。这种新颖的渐进式去噪扩散策略显著增强了跟踪器的有效性,使其能够区分不同目标。在训练阶段,成对目标框从成对真值框扩散至随机分布,模型通过逆转该加噪过程同步学习检测与跟踪。推理时,模型通过灵活的一步或多步去噪扩散过程,将一组成对随机生成框优化为检测与跟踪结果。在MOT17、MOT20和Dancetrack三个广泛使用的MOT基准上的大量实验表明,与当前最先进方法相比,本方法取得了具有竞争力的性能。