ReaMOT: A Benchmark and Framework for Reasoning-based Multi-Object Tracking

Referring Multi-Object Tracking (RMOT) aims to track targets specified by language instructions. However, existing RMOT paradigms heavily rely on explicit visual-textual matching and consequently fail to generalize to complex instructions that require logical reasoning. To overcome this, we propose Reasoning-based Multi-Object Tracking (ReaMOT), a novel task that elevates tracking to a cognitive level, requiring models to infer and track specific targets satisfying implicit constraints via logical reasoning. To advance this field, we construct the ReaMOT Challenge, a comprehensive benchmark featuring a tailored metric suite and a large scale dataset. This dataset comprises 1,156 language instructions, 423,359 image language pairs, and 869 distinct video sequences systematically categorized into six distinct evaluation scenarios, with over 75\% of the instructions dedicated to High Level Reasoning. Furthermore, recognizing that traditional trackers lack cognitive capacity while direct application of Large Vision-Language Model (LVLM) yields severe temporal inconsistencies, we propose ReaTrack. Driven by the insight to decouple high-level cognitive localization from low-level physical motion continuity, this training-free framework dynamically aligns the semantic detections of a Thinking-variant LVLM with the robust motion priors of SAM2. Extensive experiments on the ReaMOT Challenge benchmark demonstrate that ReaTrack establishes a new leading performance standard. Notably, it achieves a more than threefold improvement in RHOTA on the High Level Reasoning subset. Our dataset and code will be available at https://github.com/chen-si-jia/ReaMOT.

翻译：指代式多目标跟踪（Referring Multi-Object Tracking, RMOT）旨在跟踪由语言指令指定的目标。然而，现有的RMOT范式严重依赖显式的视觉-文本匹配，因此无法泛化至需要逻辑推理的复杂指令。为解决此问题，我们提出基于推理的多目标跟踪（Reasoning-based Multi-Object Tracking, ReaMOT），这是一项将跟踪提升至认知层面的新任务，要求模型通过逻辑推理推断并跟踪满足隐式约束的特定目标。为推进该领域研究，我们构建了ReaMOT Challenge，这是一个综合性基准，包含定制的评估指标套件和大规模数据集。该数据集涵盖1156条语言指令、423359对图像-语言对以及869个不同的视频序列，系统性地划分为六个不同的评估场景，其中超过75%的指令专注于高阶推理。此外，鉴于传统跟踪器缺乏认知能力，而直接应用大型视觉语言模型（LVLM）会导致严重的时序不一致性，我们提出了ReaTrack。该框架受解耦高阶认知定位与低层物理运动连续性的思想驱动，无需训练即可将Thinking变体LVLM的语义检测结果与SAM2的鲁棒运动先验动态对齐。在ReaMOT Challenge基准上的大量实验表明，ReaTrack树立了新的领先性能标准。值得注意的是，在高阶推理子集上，其RHOTA指标实现了超过三倍的提升。我们的数据集和代码将发布在https://github.com/chen-si-jia/ReaMOT。

相关内容

Cognition

关注 4

Cognition：Cognition：International Journal of Cognitive Science Explanation：认知：国际认知科学杂志。 Publisher：Elsevier。 SIT： http://www.journals.elsevier.com/cognition/

跨多种数据模态的视觉目标跟踪：综述

专知会员服务

30+阅读 · 2024年12月16日

《基于随机有限集的多目标跟踪》290页

专知会员服务

45+阅读 · 2024年4月20日