Large language models (LLMs) have achieved strong performance in language-centric tasks. However, in agentic settings, LLMs often struggle to anticipate action consequences and adapt to environment dynamics, highlighting the need for world-modeling capabilities in LLM-based agents. We propose Reinforcement World Model Learning (RWML), a self-supervised method that learns action-conditioned world models for LLM-based agents on textual states using sim-to-real gap rewards. Our method aligns simulated next states produced by the model with realized next states observed from the environment, encouraging consistency between internal world simulations and actual environment dynamics in a pre-trained embedding space. Unlike next-state token prediction, which prioritizes token-level fidelity (i.e., reproducing exact wording) over semantic equivalence and can lead to model collapse, our method provides a more robust training signal and is empirically less susceptible to reward hacking than LLM-as-a-judge. We evaluate our method on ALFWorld and $τ^2$ Bench and observe significant gains over the base model, despite being entirely self-supervised. When combined with task-success rewards, our method outperforms direct task-success reward RL by 6.9 and 5.7 points on ALFWorld and $τ^2$ Bench respectively, while matching the performance of expert-data training.
翻译:大语言模型(LLMs)在语言中心任务中已展现出卓越性能。然而,在智能体场景中,LLMs往往难以预测行动后果并适应环境动态,这凸显了基于LLM的智能体对世界建模能力的需求。我们提出强化世界模型学习(RWML),这是一种自监督方法,利用仿真-现实差异奖励在文本状态上为基于LLM的智能体学习行动条件化的世界模型。该方法在预训练嵌入空间中,将模型生成的模拟下一状态与环境观测到的实际下一状态进行对齐,从而促进内部世界模拟与实际环境动态之间的一致性。与下一状态词元预测方法相比——后者更关注词元级保真度(即精确复现文本表述)而非语义等价性,可能导致模型崩溃——我们的方法提供了更鲁棒的训练信号,并且经验上比LLM-as-a-judge方法更不易受到奖励篡改的影响。我们在ALFWorld和$τ^2$ Bench上评估了该方法,观察到相较于基础模型的显著性能提升,尽管该方法完全采用自监督训练。当与任务成功奖励结合使用时,我们的方法在ALFWorld和$τ^2$ Bench上分别以6.9和5.7个百分点的优势超越了直接任务成功奖励强化学习方法,同时达到了专家数据训练的性能水平。