Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.
翻译:在多轮交互环境中训练大语言模型智能体面临着一个根本性挑战:环境奖励稀疏且完成单个任务需要在一个回合内进行超过30轮交互。我们发现该场景下存在一种独特的失效模式:探索-利用级联失效。这种级联失效始于早期策略的过早收敛——稀疏的反馈导致智能体固守存在缺陷的低熵策略;随后智能体进入后期策略崩溃阶段,此时传统的熵正则化方法会产生反效果,其激发的无序探索会破坏训练稳定性。为此,我们提出熵正则化策略优化框架,该通用框架通过三重协同机制打破上述失效循环:(1)在多轮交互场景中采用熵正则化以增强探索能力;(2)通过熵平滑正则器将策略熵约束在历史均值范围内,防止剧烈波动;(3)采用基于训练阶段的动态加权机制,平衡不同训练阶段的探索与利用。理论分析证明EPO能在保证收敛性的同时,使熵方差单调递减。实验表明,EPO在ScienceWorld环境上实现最高152%的性能提升,在ALFWorld环境上实现最高19.8%的提升。本研究表明,多轮稀疏奖励场景需要与传统强化学习截然不同的熵控制机制,这对大语言模型智能体训练具有广泛启示。