Motion-o: Trajectory-Grounded Video Reasoning

Recent research has made substantial progress on video reasoning, with many models leveraging spatio-temporal evidence chains to strengthen their inference capabilities. At the same time, a growing set of datasets and benchmarks now provides structured annotations designed to support and evaluate such reasoning. However, little attention has been paid to reasoning about \emph{how} objects move between observations: no prior work has articulated the motion patterns by connecting successive observations, leaving trajectory understanding implicit and difficult to verify. We formalize this missing capability as Spatial-Temporal-Trajectory (STT) reasoning and introduce \textbf{Motion-o}, a motion-centric video understanding extension to visual language models that makes trajectories explicit and verifiable. To enable motion reasoning, we also introduce a trajectory-grounding dataset artifact that expands sparse keyframe supervision via augmentation to yield denser bounding box tracks and a stronger trajectory-level training signal. Finally, we introduce Motion Chain of Thought (MCoT), a structured reasoning pathway that makes object trajectories through discrete \texttt{<motion/>} tag summarizing per-object direction, speed, and scale (of velocity) change to explicitly connect grounded observations into trajectories. To train Motion-o, we design a reward function that compels the model to reason directly over visual evidence, all while requiring no architectural modifications. Empirical results demonstrate that Motion-o improves spatial-temporal grounding and trajectory prediction while remaining fully compatible with existing frameworks, establishing motion reasoning as a critical extension for evidence-based video understanding. Code is available at https://github.com/ostadabbas/Motion-o.

翻译：近期研究在视频推理方面取得了显著进展，许多模型利用时空证据链来增强其推理能力。同时，越来越多的数据集和基准提供了结构化标注以支持并评估此类推理。然而，目前很少有关注对象如何在观测之间移动的推理：此前暂无工作通过连接连续观测来阐明运动模式，导致轨迹理解隐式且难以验证。我们将这一缺失能力形式化为时空轨迹（Spatial-Temporal-Trajectory, STT）推理，并引入**Motion-o**——一种面向视觉语言模型的以运动为中心的视频理解扩展，使轨迹显式化且可验证。为支持运动推理，我们还引入了一种轨迹标注数据集 artifact，通过数据增强扩展稀疏关键帧标注，从而生成更密集的边界框轨迹及更强的轨迹级训练信号。最后，我们提出运动思维链（Motion Chain of Thought, MCoT），这是一种结构化推理路径，通过离散的`<motion/>`标签总结每个对象的运动方向、速度和速度尺度变化，将基于观测的推理显式连接为轨迹。为训练Motion-o，我们设计了一种奖励函数，迫使模型直接基于视觉证据进行推理，且无需修改模型架构。实验结果表明，Motion-o在提升时空定位与轨迹预测性能的同时，与现有框架完全兼容，从而确立了运动推理作为基于证据的视频理解的关键扩展。代码已开源：https://github.com/ostadabbas/Motion-o。