As a video task, Multiple Object Tracking (MOT) is expected to capture temporal information of targets effectively. Unfortunately, most existing methods only explicitly exploit the object features between adjacent frames, while lacking the capacity to model long-term temporal information. In this paper, we propose MeMOTR, a long-term memory-augmented Transformer for multi-object tracking. Our method is able to make the same object's track embedding more stable and distinguishable by leveraging long-term memory injection with a customized memory-attention layer. This significantly improves the target association ability of our model. Experimental results on DanceTrack show that MeMOTR impressively surpasses the state-of-the-art method by 7.9% and 13.0% on HOTA and AssA metrics, respectively. Furthermore, our model also outperforms other Transformer-based methods on association performance on MOT17 and generalizes well on BDD100K. Code is available at https://github.com/MCG-NJU/MeMOTR.
翻译:作为视频任务,多目标追踪(MOT)要求有效捕获目标的时序信息。然而,现有方法大多仅显式利用相邻帧间的目标特征,缺乏对长程时序信息的建模能力。本文提出MeMOTR——一种面向多目标追踪的长程记忆增强Transformer。通过定制化的记忆注意力层注入长程记忆,我们的方法能够使同一目标的轨迹嵌入更加稳定且具有区分度,从而显著提升模型的目标关联能力。在DanceTrack数据集上的实验表明,MeMOTR在HOTA和AssA指标上分别以7.9%和13.0%的绝对优势超越当前最优方法。此外,本模型在MOT17数据集上的关联性能也优于其他基于Transformer的方法,并在BDD100K上展现出良好的泛化能力。代码开源地址:https://github.com/MCG-NJU/MeMOTR。