We introduce \textbf{LaMP}, a dual-expert Vision-Language-Action framework that embeds dense 3D scene flow as a latent motion prior for robotic manipulation. Existing VLA models regress actions directly from 2D semantic visual features, forcing them to learn complex 3D physical interactions implicitly. This implicit learning strategy degrades under unfamiliar spatial dynamics. LaMP addresses this limitation by aligning a flow-matching \emph{Motion Expert} with a policy-predicting \emph{Action Expert} through gated cross-attention. Specifically, the Motion Expert generates a one-step partially denoised 3D scene flow, and its hidden states condition the Action Expert without full multi-step reconstruction. We evaluate LaMP on the LIBERO, LIBERO-Plus, and SimplerEnv-WidowX simulation benchmarks as well as real-world experiments. LaMP consistently outperforms evaluated VLA baselines across LIBERO, LIBERO-Plus, and SimplerEnv-WidowX benchmarks, achieving the highest reported average success rates under the same training budgets. On LIBERO-Plus OOD perturbations, LaMP shows improved robustness with an average 9.7% gain over the strongest prior baseline. Our project page is available at https://summerwxk.github.io/lamp-project-page/.
翻译:我们提出\textbf{LaMP},一种双专家视觉-语言-动作框架,该框架将密集三维场景流作为潜在运动先验嵌入到机器人操作中。现有VLA模型直接从二维语义视觉特征回归动作,迫使模型隐式学习复杂的三维物理交互。这种隐式学习策略在陌生的空间动力学环境下性能会退化。LaMP通过门控交叉注意力机制,将流匹配的\textit{运动专家}与预测策略的\textit{动作专家}对齐来克服这一局限。具体而言,运动专家生成一步部分去噪的三维场景流,其隐藏状态为动作专家提供条件信息,无需完整的多步重建。我们在LIBERO、LIBERO-Plus和SimplerEnv-WidowX仿真基准以及真实世界实验中对LaMP进行了评估。在LIBERO、LIBERO-Plus和SimplerEnv-WidowX基准测试中,LaMP始终优于所评估的VLA基线方法,在相同训练预算下取得了最高的平均成功率。在LIBERO-Plus的OOD扰动测试中,LaMP展现出更强的鲁棒性,相比最强先验基线平均提升了9.7%。我们的项目页面见https://summerwxk.github.io/lamp-project-page/。