Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms heavily rely on explicit visual-textual matching and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that elevates tracking to a cognitive level, requiring models to infer and track specific targets satisfying implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark featuring a tailored metric suite and a large scale dataset. This dataset comprises 1,156 language instructions, 423,359 image language pairs, and 869 distinct video sequences systematically categorized into six distinct evaluation scenarios, with over 75\% of the instructions dedicated to High Level Reasoning. Furthermore, recognizing that traditional trackers lack cognitive capacity while direct application of Large Vision-Language Model (LVLM) yields severe temporal inconsistencies, we propose ReaTrack. Driven by the insight to decouple high-level cognitive localization from low-level physical motion continuity, this training-free framework dynamically aligns the semantic detections of a Thinking-variant LVLM with the robust motion priors of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrate that ReaTrack establishes a new leading performance standard. Notably, it achieves a more than threefold improvement in RHOTA on the High Level Reasoning subset. Our dataset and code will be available at https://github.com/chen-si-jia/ReaMOT.
翻译:指代式多目标跟踪(Referring Multi-Object Tracking, RMOT)旨在跟踪由语言指令指定的目标。然而,现有的RMOT范式严重依赖显式的视觉-文本匹配,因此无法泛化至需要逻辑推理的复杂指令。为解决此问题,我们提出基于推理的多目标跟踪(Reasoning-based Multi-Object Tracking, ReaMOT),这是一项将跟踪提升至认知层面的新任务,要求模型通过逻辑推理推断并跟踪满足隐式约束的特定目标。为推进该领域研究,我们构建了ReaMOT Challenge,这是一个综合性基准,包含定制的评估指标套件和大规模数据集。该数据集涵盖1156条语言指令、423359对图像-语言对以及869个不同的视频序列,系统性地划分为六个不同的评估场景,其中超过75%的指令专注于高阶推理。此外,鉴于传统跟踪器缺乏认知能力,而直接应用大型视觉语言模型(LVLM)会导致严重的时序不一致性,我们提出了ReaTrack。该框架受解耦高阶认知定位与低层物理运动连续性的思想驱动,无需训练即可将Thinking变体LVLM的语义检测结果与SAM2的鲁棒运动先验动态对齐。在ReaMOT Challenge基准上的大量实验表明,ReaTrack树立了新的领先性能标准。值得注意的是,在高阶推理子集上,其RHOTA指标实现了超过三倍的提升。我们的数据集和代码将发布在https://github.com/chen-si-jia/ReaMOT。