For a long time, the most common paradigm in Multi-Object Tracking was tracking-by-detection (TbD), where objects are first detected and then associated over video frames. For association, most models resourced to motion and appearance cues, e.g., re-identification networks. Recent approaches based on attention propose to learn the cues in a data-driven manner, showing impressive results. In this paper, we ask ourselves whether simple good old TbD methods are also capable of achieving the performance of end-to-end models. To this end, we propose two key ingredients that allow a standard re-identification network to excel at appearance-based tracking. We extensively analyse its failure cases, and show that a combination of our appearance features with a simple motion model leads to strong tracking results. Our tracker generalizes to four public datasets, namely MOT17, MOT20, BDD100k, and DanceTrack, achieving state-of-the-art performance. https://github.com/dvl-tum/GHOST.
翻译:长期以来,多目标跟踪中最常见的范式是检测后跟踪(TbD),即先检测物体,再在视频帧间进行关联。在关联过程中,大多数模型依赖运动与外观线索,例如重识别网络。而基于注意力机制的最新方法提出以数据驱动方式学习线索,展现了令人瞩目的成果。本文中,我们探究了传统的简易TbD方法是否也能达到端到端模型的性能。为此,我们提出了两个关键要素,使标准重识别网络在外观跟踪中表现出色。我们深入分析了其失败案例,并证明将外观特征与简单运动模型相结合,能够带来强大的跟踪效果。我们的跟踪器在MOT17、MOT20、BDD100k和DanceTrack这四个公开数据集上均实现了泛化,并达到了最先进的性能。项目地址:https://github.com/dvl-tum/GHOST。