Large language model (LLM)-based agents are increasingly trained with reinforcement learning (RL) to enhance their ability to interact with external environments through tool use, particularly in search-based settings that require multi-turn reasoning and knowledge acquisition. However, existing approaches typically rely on outcome-based rewards that are only provided exclusively upon generating the final answer. This reward sparsity becomes particularly problematic in multi-turn settings, where long trajectories exacerbate three critical issues: (i) advantage collapse, where all rollouts receive identical rewards and provide no useful learning signals; (ii) lack of fine-grained credit assignment, where the correctness of intermediate turns is obscured, especially in long-horizon tasks; and (iii) poor sample efficiency, where each rollout yields only a single outcome signal, leading to low data utilization. In this paper, we propose Information Gain-based Policy Optimization (IGPO), a simple yet effective RL framework that provides dense and intrinsic supervision for multi-turn agent training. IGPO models each interaction turn as an incremental process of acquiring information about the ground truth, and defines turn-level rewards as the marginal increase in the policy's probability of producing the correct answer. Unlike prior process-level reward approaches that depend on external reward models or costly Monte Carlo estimation, IGPO derives intrinsic rewards directly from the model's own belief updates. These intrinsic turn-level rewards are combined with outcome-level supervision to form dense reward signals. Extensive experiments on both in-domain and out-of-domain benchmarks demonstrate that IGPO consistently outperforms strong baselines in multi-turn scenarios, achieving higher accuracy and improved data efficiency. Our code is available at https://github.com/GuoqingWang1/IGPO.
翻译:大语言模型(LLM)智能体越来越多地采用强化学习(RL)进行训练,以增强其通过工具使用与外部环境交互的能力,特别是在需要多轮推理和知识获取的搜索场景中。然而,现有方法通常依赖仅能在生成最终答案时获得的基于结果的奖励。这种奖励稀疏性在多轮设定中尤为突出,长轨迹会加剧三个关键问题:(i)优势崩溃,所有轨迹获得相同奖励,无法提供有效学习信号;(ii)缺乏细粒度信用分配,中间步骤的正确性被掩盖,尤其在长时域任务中;(iii)样本效率低下,每条轨迹仅产生一个结果信号,导致数据利用率低。本文提出基于信息增益的策略优化(IGPO),一种简单而有效的强化学习框架,为多轮智能体训练提供密集的内在监督。IGPO将每次交互轮次建模为获取真实信息增量过程,并定义轮次奖励为策略产生正确答案概率的边际增长。与依赖外部奖励模型或昂贵蒙特卡洛估计的现有过程级奖励方法不同,IGPO直接从模型自身的信念更新中推导内在奖励。这些内在轮次奖励与结果级监督相结合,形成密集奖励信号。在域内和域外基准测试上的大量实验表明,IGPO在多轮场景中始终优于强基线,实现了更高准确率和更优数据效率。我们的代码开源在https://github.com/GuoqingWang1/IGPO。