Multi-Object Tracking (MOT) has been a long-standing challenge in video understanding. A natural and intuitive approach is to split this task into two parts: object detection and association. Most mainstream methods employ meticulously crafted heuristic techniques to maintain trajectory information and compute cost matrices for object matching. Although these methods can achieve notable tracking performance, they often require a series of elaborate handcrafted modifications while facing complicated scenarios. We believe that manually assumed priors limit the method's adaptability and flexibility in learning optimal tracking capabilities from domain-specific data. Therefore, we introduce a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we propose a simple yet effective method termed MOTIP. Given a set of trajectories carried with ID information, MOTIP directly decodes the ID labels for current detections to accomplish the association process. Without using tailored or sophisticated architectures, our method achieves state-of-the-art results across multiple benchmarks by solely leveraging object-level features as tracking cues. The simplicity and impressive results of MOTIP leave substantial room for future advancements, thereby making it a promising baseline for subsequent research. Our code and checkpoints are released at https://github.com/MCG-NJU/MOTIP.
翻译:多目标跟踪(MOT)长期以来一直是视频理解领域的一项挑战。一种自然而直观的方法是将该任务拆分为两部分:目标检测与关联。主流方法大多采用精心设计的启发式技术来维护轨迹信息并计算用于目标匹配的代价矩阵。尽管这些方法能够取得显著的跟踪性能,但它们通常需要一系列复杂的手工调整,同时面临复杂场景的挑战。我们认为,手动设定的先验限制了方法从特定领域数据中学习最优跟踪能力的适应性与灵活性。因此,我们引入一种新视角,将多目标跟踪视为一种上下文ID预测任务,从而将前述的目标关联转化为端到端可训练的任务。基于此,我们提出一种简单而有效的方法,称为MOTIP。给定一组携带ID信息的轨迹,MOTIP直接解码当前检测框的ID标签以完成关联过程。无需使用定制或复杂的架构,我们的方法仅利用目标级特征作为跟踪线索,便在多个基准测试中取得了最先进的结果。MOTIP的简洁性与卓越性能为未来研究留下了充足的改进空间,使其成为后续工作的一个极具潜力的基准方法。我们的代码与模型检查点已发布于 https://github.com/MCG-NJU/MOTIP。