Reinforcement learning has emerged as an effective paradigm for training large language models to perform search-augmented reasoning. However, existing approaches rely on trajectory-level rewards that cannot distinguish precise search queries from vague or redundant ones within a rollout group, and collapse to a near-zero gradient signal whenever every sampled trajectory fails. In this paper, we propose IG-Search, a reinforcement learning framework that introduces a step-level reward based on Information Gain (IG). For each search step, IG measures how much the retrieved documents improve the model's confidence in the gold answer relative to a counterfactual baseline of random documents, thereby reflecting the effectiveness of the underlying search query. This signal is fed back to the corresponding search-query tokens via per-token advantage modulation in GRPO, enabling fine-grained, step-level credit assignment within a rollout. Unlike prior step-level methods that require either externally annotated intermediate supervision or shared environment states across trajectories, IG-Search derives its signals from the policy's own generation probabilities, requiring no intermediate annotations beyond standard question-answer pairs. Experiments on seven single-hop and multi-hop QA benchmarks demonstrate that IG-Search achieves an average EM of 0.430 with Qwen2.5-3B, outperforming the strongest trajectory-level baseline (MR-Search) by 1.6 points and the step-level method GiGPO by 0.9 points on average across benchmarks, with particularly pronounced gains on multi-hop reasoning tasks. Despite introducing a dense step-level signal, IG-Search adds only ~6.4% to per-step training wall-clock time over the trajectory-level baseline and leaves inference latency unchanged, while still providing a meaningful gradient signal even when every sampled trajectory answers incorrectly.
翻译:[translated abstract in Chinese]
强化学习已成为训练大型语言模型进行搜索增强推理的有效范式。然而,现有方法依赖轨迹级奖励,无法区分同一采样组中精确搜索查询与模糊或冗余查询的差异,并且在所有采样轨迹均失败时,梯度信号会趋近于零。本文提出IG-Search——一种引入基于信息增益的步骤级奖励的强化学习框架。在每一步搜索中,信息增益通过衡量检索文档相对于随机文档的反事实基线所提升的模型对黄金答案的置信度,从而反映底层搜索查询的有效性。该信号通过GRPO中的逐token优势调节反馈给对应搜索查询token,实现采样组内细粒度的步骤级信用分配。不同于需要外部标注中间监督或跨轨迹共享环境状态的先前步骤级方法,IG-Search从策略自身生成概率中提取信号,除标准问答对之外无需任何中间标注。在七个单跳和多跳问答基准上的实验表明,采用Qwen2.5-3B的IG-Search平均精确匹配得分为0.430,在跨基准测试中平均超越最强轨迹级基线方法(MR-Search)1.6个百分点,以及步骤级方法GiGPO 0.9个百分点,尤其在多跳推理任务上提升显著。尽管引入了密集的步骤级信号,IG-Search每个步骤的训练时间仅比轨迹级基线增加约6.4%,且推理延迟保持不变,同时即使所有采样轨迹均回答错误,仍能提供有意义的梯度信号。