The outstanding capabilities of large language models (LLMs) render them a crucial component in various autonomous agent systems. While traditional methods depend on the inherent knowledge of LLMs without fine-tuning, more recent approaches have shifted toward the reinforcement learning strategy to further enhance agents' ability to solve complex interactive tasks with environments and tools. However, previous approaches are constrained by the sparse reward issue, where existing datasets solely provide a final scalar reward for each multi-step reasoning chain, potentially leading to ineffectiveness and inefficiency in policy learning. In this paper, we introduce StepAgent, which utilizes step-wise reward to optimize the agent's reinforcement learning process. Inheriting the spirit of novice-to-expert theory, we first compare the actions of the expert and the agent to automatically generate intermediate rewards for fine-grained optimization. Additionally, we propose implicit-reward and inverse reinforcement learning techniques to facilitate agent reflection and policy adjustment. Further theoretical analysis demonstrates that the action distribution of the agent can converge toward the expert action distribution over multiple training cycles. Experimental results across various datasets indicate that StepAgent outperforms existing baseline methods.
翻译:大型语言模型(LLM)的卓越能力使其成为各类自主智能体系统的关键组成部分。传统方法依赖LLM的固有知识而无需微调,而近期研究则转向强化学习策略,以进一步提升智能体在环境和工具交互中解决复杂任务的能力。然而,现有方法受限于稀疏奖励问题:当前数据集仅对多步推理链提供最终标量奖励,可能导致策略学习的低效与失效。本文提出StepAgent,利用分步奖励优化智能体的强化学习过程。秉承从新手到专家的理论思想,我们首先通过对比专家与智能体的行为自动生成中间奖励,以实现细粒度优化。此外,我们提出隐式奖励与逆强化学习技术,以促进智能体的反思与策略调整。进一步的理论分析表明,经过多轮训练后,智能体的行为分布可收敛至专家行为分布。跨数据集的实验结果表明,StepAgent在性能上优于现有基线方法。