Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.
翻译:机器人操作关键依赖于能够保留场景中动作相关方面的感知。然而,大多数机器人学习流水线基于为静态识别或视觉-语言对齐而预训练的视觉编码器构建,将运动理解留给下游策略。我们提出DynaFLIP,一种动力学感知的多模态预训练框架,将运动理解推向上游感知。我们从异构的人类和机器人视频中构建图像-语言-3D流三元组,并使用这些三元组作为训练时监督来塑造仅使用图像的编码器。我们的关键思想是鼓励三种模态在共享超球面空间中跨越一个小的单纯形体积——更小的单纯形体积表示更强的对齐。为避免朴素体积最小化的几何歧义和琐碎崩溃,我们将单纯形体积最小化与余弦正则化器和对比目标相结合。我们的分析表明,DynaFLIP专注于对操作至关重要的控制相关区域。由此产生的动力学感知表示可作为可复用的视觉主干,并在各种下游策略(包括VLA)中持续优于基线。我们在多种仿真和真实世界设置中对此进行了验证,在分布外场景下实现了高达+22.5%的性能提升。我们的结果表明,当视觉表示被训练为不仅编码存在什么,而且编码世界如何在动作下变化时,机器人的泛化能力会得到改善。