World action models (WAMs) have emerged as a promising direction for robot policy learning, as they can leverage powerful video backbones to model the future states. However, existing approaches often rely on separate action modules, or use action representations that are not pixel-grounded, making it difficult to fully exploit the pretrained knowledge of video models and limiting transfer across viewpoints and environments. In this work, we present Action Images, a unified world action model that formulates policy learning as multiview video generation. Instead of encoding control as low-dimensional tokens, we translate 7-DoF robot actions into interpretable action images: multi-view action videos that are grounded in 2D pixels and explicitly track robot-arm motion. This pixel-grounded action representation allows the video backbone itself to act as a zero-shot policy, without a separate policy head or action module. Beyond control, the same unified model supports video-action joint generation, action-conditioned video generation, and action labeling under a shared representation. On RLBench and real-world evaluations, our model achieves the strongest zero-shot success rates and improves video-action joint generation quality over prior video-space world models, suggesting that interpretable action images are a promising route to policy learning.
翻译:世界动作模型(WAMs)已成为机器人策略学习的一个有前景的方向,因为它们可以利用强大的视频骨干网络来建模未来状态。然而,现有方法通常依赖独立的动作模块,或采用不基于像素的动作表示,这使得难以充分利用视频模型的预训练知识,并限制了跨视角和环境的迁移能力。在这项工作中,我们提出了动作图像(Action Images),一种统一的全局动作模型,将策略学习表述为多视角视频生成。我们不将控制编码为低维令牌,而是将7自由度机器人动作转化为可解释的动作图像:基于二维像素的多视角动作视频,显式跟踪机器人手臂运动。这种基于像素的动作表示允许视频骨干网络本身充当零样本策略,无需独立的策略头或动作模块。除了控制之外,同一统一模型还支持视频-动作联合生成、动作条件视频生成以及共享表示下的动作标注。在RLBench和真实世界评估中,我们的模型实现了最强的零样本成功率,并相比之前的视频空间世界模型提高了视频-动作联合生成质量,这表明可解释的动作图像是策略学习的一种有前景路径。