We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
翻译:我们提出ReMoT,一种统一的训练范式,旨在系统性解决视觉-语言模型在时空一致性方面的根本缺陷——这一缺陷是导航、机器人技术和自动驾驶中的关键失败点。ReMoT集成了两个核心组件:(1)一个基于规则的自动框架,用于生成ReMoT-16K——一个从视频元注释派生出的大规模(16.5K三元组)运动对比数据集,超越了昂贵的手动或基于模型的生成方式;(2)分组相对策略优化,我们通过实证验证该方法在对比推理学习中实现了最优性能和数据效率,远超标准监督微调。我们还构建了首个细粒度运动对比三元组基准,用于衡量视觉-语言模型对细微运动属性(例如相反方向)的判别能力。最终模型在新增基准和多个标准视觉-语言模型基准上达到了最先进性能,在时空推理任务上实现了显著的25.1%性能提升。