Reinforcement learning (RL) has emerged as a powerful paradigm for training Large Language Models (LLMs) as agents. However, conventional RL methods for long-horizon agentic tasks often struggle with sparse outcome rewards. Intuitively, this overlooks the rich environment dynamics information contained in rollout interaction trajectories. We argue that the interaction experience inherently serves as an implicit supervision signal, reveals the underlying transition mechanisms of the environment, and enables the agent to construct a more accurate internal model of the environment.. Therefore, in this work, we investigate how to leverage this additional signal to improve policy learning. Specifically, we propose EnvRL, a framework that incorporates environment dynamics learning into agentic RL via two auxiliary objectives: state prediction and inverse dynamics. By jointly optimizing with the primary RL objective, we encourage the agent to internalize environment dynamics from its own interaction experience. Extensive experiments on two long-horizon agentic benchmarks demonstrate that EnvRL achieves significant improvements on success-rates over RL-only baselines, e.g., when trained with GRPO, lifting Qwen-2.5-1.5B-Instruct from 72.8% to 77.4% on ALFWorld, and from 56.8% to 67.0% on WebShop.
翻译:强化学习(Reinforcement Learning, RL)已成为训练大型语言模型(LLMs)作为智能体的强大范式。然而,针对长时域智能体任务的传统强化学习方法常因稀疏的结果奖励而表现不佳。直观上,这忽视了交互轨迹中蕴含的丰富环境动力学信息。我们认为,交互体验本质上是一种隐式的监督信号,揭示了环境的底层转移机制,并使智能体能够构建更准确的环境内部模型。因此,本文探索如何利用这一额外信号来改进策略学习。具体而言,我们提出EnvRL框架,通过两个辅助目标——状态预测与逆动力学——将环境动力学学习融入智能体强化学习。在与主强化学习目标联合优化过程中,我们促使智能体从自身交互经验中内化环境动力学。在两项长时域智能体基准任务上的大量实验表明,EnvRL相较于纯强化学习基线在成功率上实现了显著提升,例如,在使用GRPO训练时,将Qwen-2.5-1.5B-Instruct在ALFWorld上的表现从72.8%提升至77.4%,在WebShop上从56.8%提升至67.0%。