Recent World-Action (WA) models demonstrate strong generalization ability and data efficiency, but they typically rely on expert trajectories for training. This reliance limits their ability to acquire fine-grained manipulation skills beyond the demonstration distribution and prevents them from continuously improving through real-world interaction. To address these limitations, we propose WAM-RL, a reinforcement learning framework that enables joint optimization of the world model and the action model through online interaction with the environment. By allowing the two components to co-evolve, our approach enhances fine-grained control and adaptability. Specifically, a WA model consists of a world model and an actor. We design a tailored reinforcement learning method with hierarchical optimization to coordinate their improvement. On the methodological side, we systematically investigate the effects of applying reinforcement learning to the action model, as well as online training of the world model within an RL setting. Our experiments reveal a key insight: optimizing only the actor yields improvements on short-horizon tasks, but fails to provide significant gains on long-horizon tasks. In contrast, jointly optimizing both the world model and the actor is critical for achieving strong performance in long-horizon settings. Our work is the first to introduce reinforcement learning into the World-Action paradigm, and provides insights into how online optimization of both the action head and the world model impacts overall performance.
翻译:近期提出的世界-动作(World-Action,WA)模型展现出强大的泛化能力和数据效率,但其训练通常依赖专家轨迹。这一依赖性限制了模型获取超出演示分布范围的精细操作技能,并阻碍其通过实际交互实现持续改进。为解决上述局限,我们提出WAM-RL——一种通过环境在线交互实现世界模型与动作模型联合优化的强化学习框架。通过使两个组件共同进化,该方法增强了精细控制能力与适应性。具体而言,WA模型由世界模型和动作模型组成。我们设计了具有分层优化机制的定制化强化学习方法,以协调两者改进。在方法论层面,我们系统研究了将强化学习应用于动作模型以及在线训练世界模型(在强化学习框架下)的效果。实验揭示了一个关键发现:仅优化动作模型能在短期任务中带来改进,但在长期任务中效果有限。相比之下,联合优化世界模型和动作模型对长期任务性能至关重要。本研究首次将强化学习引入世界-动作范式,并为在线优化动作头与世界模型对整体性能的影响机制提供了见解。