Humans learn not only how their bodies move, but also how the surrounding world responds to their actions. In contrast, while recent Vision-Language-Action (VLA) models exhibit impressive semantic understanding, they often fail to capture the spatiotemporal dynamics governing physical interaction. In this paper, we introduce Pri4R, a simple yet effective approach that endows VLA models with an implicit understanding of world dynamics by leveraging privileged 4D information during training. Specifically, Pri4R augments VLAs with a lightweight point track head that predicts 3D point tracks. By injecting VLA features into this head to jointly predict future 3D trajectories, the model learns to incorporate evolving scene geometry within its shared representation space, enabling more physically aware context for precise control. Due to its architectural simplicity, Pri4R is compatible with dominant VLA design patterns with minimal changes. During inference, we run the model using the original VLA architecture unchanged; Pri4R adds no extra inputs, outputs, or computational overhead. Across simulation and real-world evaluations, Pri4R significantly improves performance on challenging manipulation tasks, including a +10% gain on LIBERO-Long and a +40% gain on RoboCasa. We further show that 3D point track prediction is an effective supervision target for learning action-world dynamics, and validate our design choices through extensive ablations.
翻译:人类不仅学习自身躯体的运动方式,还学习周围世界对其动作的响应规律。相比之下,尽管近期视觉-语言-动作模型展现出卓越的语义理解能力,却往往难以捕捉支配物理交互的时空动态。本文提出Pri4R——一种简洁而高效的方法,通过在训练阶段利用特权4D信息,赋予VLA模型对世界动态的隐式理解。具体而言,Pri4R为VLA模型增配轻量级三维点轨迹预测头,通过将VLA特征注入该预测头以联合预测未来三维轨迹,使模型学会在共享表征空间中融合演化的场景几何信息,从而为精确控制提供更具物理感知的上下文。得益于其架构简洁性,Pri4R仅需极小改动即可适配主流VLA设计范式。在推理阶段,模型完全沿用原始VLA架构运行,无需额外输入输出或增加计算开销。在仿真与真实环境评估中,Pri4R在复杂操作任务上取得显著性能提升:在LIBERO-Long任务中提升10%,在RoboCasa任务中提升40%。我们进一步证明三维点轨迹预测是学习动作-世界动态的有效监督目标,并通过大量消融实验验证了设计决策的有效性。