DIAL: Decoupling Intent and Action via Latent World Modeling for End-to-End VLA

The development of Vision-Language-Action (VLA) models has been significantly accelerated by pre-trained Vision-Language Models (VLMs). However, most existing end-to-end VLAs treat the VLM primarily as a multimodal encoder, directly mapping vision-language features to low-level actions. This paradigm underutilizes the VLM's potential in high-level decision making and introduces training instability, frequently degrading its rich semantic representations. To address these limitations, we introduce DIAL, a framework bridging high-level decision making and low-level motor execution through a differentiable latent intent bottleneck. Specifically, a VLM-based System-2 performs latent world modeling by synthesizing latent visual foresight within the VLM's native feature space; this foresight explicitly encodes intent and serves as the structural bottleneck. A lightweight System-1 policy then decodes this predicted intent together with the current observation into precise robot actions via latent inverse dynamics. To ensure optimization stability, we employ a two-stage training paradigm: a decoupled warmup phase where System-2 learns to predict latent futures while System-1 learns motor control under ground-truth future guidance within a unified feature space, followed by seamless end-to-end joint optimization. This enables action-aware gradients to refine the VLM backbone in a controlled manner, preserving pre-trained knowledge. Extensive experiments on the RoboCasa GR1 Tabletop benchmark show that DIAL establishes a new state-of-the-art, achieving superior performance with 10x fewer demonstrations than prior methods. Furthermore, by leveraging heterogeneous human demonstrations, DIAL learns physically grounded manipulation priors and exhibits robust zero-shot generalization to unseen objects and novel configurations during real-world deployment on a humanoid robot.

翻译：预训练的视觉-语言模型（VLM）极大地加速了视觉-语言-动作（VLA）模型的发展。然而，现有的大多数端到端VLA将VLM主要视为多模态编码器，直接映射视觉-语言特征到低层动作。这种范式未能充分利用VLM在高层次决策中的潜力，并引入了训练不稳定性，经常损害其丰富的语义表征。为解决这些限制，我们提出了DIAL框架，该框架通过一个可微分的潜在意图瓶颈在高层次决策与低层运动执行之间建立桥梁。具体而言，基于VLM的系统-2通过在VLM的原生特征空间中合成潜在视觉预见来实现潜在世界建模；这种视觉预见显式编码意图并充当结构性瓶颈。随后，轻量级系统-1策略通过潜在逆动力学将这一预测的意图与当前观测一同解码为精确的机器人动作。为确保优化稳定性，我们采用两阶段训练范式：在解耦的预热阶段，系统-2学习预测潜在未来状态，而系统-1在统一特征空间内利用真实未来状态指导学习运动控制；随后无缝衔接端到端联合优化。这使得动作感知梯度能够以受控方式优化VLM主干网络，从而保留预训练知识。在RoboCasa GR1桌面基准上的大量实验表明，DIAL建立了新的最先进水平，在演示数据量减少10倍的情况下实现了更优性能。此外，通过利用异质人类演示，DIAL学习了基于物理的操控先验知识，并在人形机器人实际部署中展现出对未见物体和新颖配置的鲁棒零样本泛化能力。