We present ReMoT, a unified training paradigm to systematically address the fundamental shortcomings of VLMs in spatio-temporal consistency -- a critical failure point in navigation, robotics, and autonomous driving. ReMoT integrates two core components: (1) A rule-based automatic framework that generates ReMoT-16K, a large-scale (16.5K triplets) motion-contrast dataset derived from video meta-annotations, surpassing costly manual or model-based generation. (2) Group Relative Policy Optimization, which we empirically validate yields optimal performance and data efficiency for learning this contrastive reasoning, far exceeding standard Supervised Fine-Tuning. We also construct the first benchmark for fine-grained motion contrast triplets to measure a VLM's discrimination of subtle motion attributes (e.g., opposing directions). The resulting model achieves state-of-the-art performance on our new benchmark and multiple standard VLM benchmarks, culminating in a remarkable 25.1% performance leap on spatio-temporal reasoning tasks.
翻译:本文提出ReMoT,一种统一的训练范式,系统性解决视觉语言模型(VLM)在时空一致性方面的根本缺陷——这是导航、机器人和自动驾驶领域的关键失效点。ReMoT集成两大核心组件:(1)基于规则的自动框架,用于生成ReMoT-16K数据集——一个大规模(16.5K三元组)的运动对比数据集,该数据集源自视频元注释,超越了昂贵的人工或模型生成方式;(2)分组相对策略优化(GRPO)方法,我们通过实验验证其在学习对比推理方面能达到最优性能和最高数据效率,远超标准监督微调。我们还构建了首个细粒度运动对比三元组基准测试,用于衡量VLM对细微运动属性(如相反方向)的判别能力。最终模型在新基准测试和多个标准VLM基准测试中均取得最优性能,在时空推理任务上实现了令人瞩目的25.1%性能提升。