Multiple Object Tracking (MOT) aims to find bounding boxes and identities of targeted objects in consecutive video frames. While fully-supervised MOT methods have achieved high accuracy on existing datasets, they cannot generalize well on a newly obtained dataset or a new unseen domain. In this work, we first address the MOT problem from the cross-domain point of view, imitating the process of new data acquisition in practice. Then, a new cross-domain MOT adaptation from existing datasets is proposed without any pre-defined human knowledge in understanding and modeling objects. It can also learn and update itself from the target data feedback. The intensive experiments are designed on four challenging settings, including MOTSynth to MOT17, MOT17 to MOT20, MOT17 to VisDrone, and MOT17 to DanceTrack. We then prove the adaptability of the proposed self-supervised learning strategy. The experiments also show superior performance on tracking metrics MOTA and IDF1, compared to fully supervised, unsupervised, and self-supervised state-of-the-art methods.
翻译:多目标跟踪(MOT)旨在连续视频帧中寻找目标对象的边界框和身份标识。尽管全监督MOT方法在现有数据集上取得了高精度,但无法很好地泛化到新获取的数据集或未见过的全新领域。本研究首次从跨域视角解决MOT问题,模拟实际中获取新数据的过程;进而提出一种无需任何预定义人类知识理解与建模对象的跨域MOT自适应方法,该方法还可从目标数据反馈中自我学习与更新。我们在四种具有挑战性的设置上进行了密集实验,包括MOTSynth→MOT17、MOT17→MOT20、MOT17→VisDrone及MOT17→DanceTrack。实验证明了所提出的自监督学习策略的适应性,并与全监督、无监督及自监督的现有最优方法相比,在跟踪指标MOTA和IDF1上展现了优越性能。