Entropy-Regularized Token-Level Policy Optimization for Language Agent Reinforcement

Large Language Models (LLMs) have shown promise as intelligent agents in interactive decision-making tasks. Traditional approaches often depend on meticulously designed prompts, high-quality examples, or additional reward models for in-context learning, supervised fine-tuning, or RLHF. Reinforcement learning (RL) presents a dynamic alternative for LLMs to overcome these dependencies by engaging directly with task-specific environments. Nonetheless, it faces significant hurdles: 1) instability stemming from the exponentially vast action space requiring exploration; 2) challenges in assigning token-level credit based on action-level reward signals, resulting in discord between maximizing rewards and accurately modeling corpus data. In response to these challenges, we introduce Entropy-Regularized Token-level Policy Optimization (ETPO), an entropy-augmented RL method tailored for optimizing LLMs at the token level. At the heart of ETPO is our novel per-token soft Bellman update, designed to harmonize the RL process with the principles of language modeling. This methodology decomposes the Q-function update from a coarse action-level view to a more granular token-level perspective, backed by theoretical proof of optimization consistency. Crucially, this decomposition renders linear time complexity in action exploration. We assess the effectiveness of ETPO within a simulated environment that models data science code generation as a series of multi-step interactive tasks; results underline ETPO's potential as a robust method for refining the interactive decision-making capabilities of language agents. For a more detailed preliminary work describing our motivation for token-level decomposition and applying it in PPO methods, please refer to arXiv:2405.15821.

翻译：大型语言模型（LLMs）在交互式决策任务中展现出作为智能体的潜力。传统方法通常依赖精心设计的提示、高质量示例或额外奖励模型进行上下文学习、监督微调或基于人类反馈的强化学习（RLHF）。强化学习为LLMs提供了动态替代方案，通过直接与任务特定环境交互来克服这些依赖性。然而，它面临重大挑战：1) 由于需要探索指数级庞大的动作空间而导致的不稳定性；2) 基于动作级奖励信号分配令牌级信用所引发的难题，导致最大化奖励与准确建模语料数据之间的冲突。针对这些挑战，我们提出熵正则化令牌级策略优化（ETPO），这是一种专为在令牌级优化LLMs而设计的熵增强强化学习方法。ETPO的核心是我们创新的逐令牌软贝尔曼更新，旨在协调强化学习过程与语言建模原理。该方法将Q函数更新从粗粒度的动作级视角分解为更细粒度的令牌级视角，并附有优化一致性的理论证明。关键在于，这种分解在动作探索中实现了线性时间复杂度。我们在模拟数据科学代码生成作为多步骤交互任务序列的仿真环境中评估了ETPO的有效性；结果凸显了ETPO作为优化语言智能体交互决策能力的鲁棒方法的潜力。关于我们令牌级分解动机及其在PPO方法中应用的更详细前期工作，请参阅arXiv:2405.15821。