Vision-Language-Action models (VLAs) leverage large-scale vision-language pretraining for semantic robot control, but often lack explicit foresight into how robot actions change the scene. World-Action Models (WAMs) address this limitation by conditioning policies on predicted futures, yet existing approaches typically rely on computationally expensive video generation with substantial pixel-level redundancy. We present LaWAM, a Latent World Action Model that exposes predictive dynamics to robot policies through compact latent visual subgoals instead of reconstructed future video. At the core of LaWAM is a latent-action-conditioned Latent World Model (LaWM). We obtain LaWM by training a latent action model in the latent space of a pretrained vision foundation model and repurposing its forward decoder to predict future observation features for scene evolution. LaWAM then conditions action generation on these predicted latent visual subgoals to enable dynamics-aware robot control. LaWAM achieves state-of-the-art or competitive success rates (SRs) across LIBERO (98.6% SR), RoboTwin (91.22% SR), and real-world manipulation tasks while retaining low-latency inference. LaWAM runs in 187 ms per action-chunk prediction and achieves up to 24x lower wall-clock latency than pixel-space WAMs.
翻译:视觉-语言-行动模型利用大规模视觉-语言预训练实现语义级机器人控制,但通常缺乏对机器人动作如何改变场景的显式预判。世界行动模型通过将策略建立在预测的未来状态之上来解决这一局限,然而现有方法通常依赖计算代价高昂且存在大量像素级冗余的视频生成。我们提出LaWAM——一种潜在世界行动模型,该模型通过紧凑的潜在视觉子目标而非重建的未来视频,向机器人策略暴露预测性动力学信息。LaWAM的核心是一个受潜在动作条件约束的潜在世界模型。我们通过在预训练视觉基础模型的潜在空间中训练潜在动作模型,并重新利用其前向解码器预测用于场景演化的未来观测特征,从而获得LaWM。LaWAM进而将这些预测的潜在视觉子目标作为动作生成的条件,实现动力学感知的机器人控制。在LIBERO(98.6%成功率)、RoboTwin(91.22%成功率)和真实世界操作任务中,LaWAM达到了最优或具有竞争力的成功率,同时保持低延迟推理。LaWAM每次动作块预测仅需187毫秒,相较像素空间的世界行动模型实现了高达24倍的时钟时间延迟降低。