Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms are largely designed for explicit instructions and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that requires models to identify and track targets that satisfy implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark comprising: (1) a large-scale dataset with 1,156 instructions categorized into High-Level Reasoning and Low-Level Perception, covering 423,359 image-language pairs across 869 diverse scenes; and (2) a tailored metric suite designed to jointly evaluate reasoning accuracy and tracking robustness. Furthermore, we propose ReaTrack, a training-free framework that synergizes the reasoning capabilities of Thinking-variant Large Vision-Language Model (LVLM) with the precise temporal modeling of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrates the effectiveness of our ReaTrack framework.
翻译:指代多目标跟踪(RMOT)旨在根据语言指令跟踪指定目标。然而,现有的RMOT范式主要针对显式指令设计,因此难以泛化到需要逻辑推理的复杂指令。为克服这一局限,我们提出了基于推理的多目标跟踪(ReaMOT)这一新任务,要求模型通过逻辑推理识别并跟踪满足隐式约束的目标。为推进该领域发展,我们构建了ReaMOT Challenge基准,包含:(1)大规模数据集,涵盖1,156条指令并划分为高级推理与低级感知两类,包含869个多样化场景中的423,359个图像-语言对;(2)专门设计的评估指标套件,用于联合评估推理准确性与跟踪鲁棒性。此外,我们提出了ReaTrack,一种免训练框架,该框架将Thinking变体大型视觉语言模型(LVLM)的推理能力与SAM2的精确时序建模能力相结合。在ReaMOT Challenge基准上的大量实验验证了我们ReaTrack框架的有效性。