Optimizing Agentic Reasoning with Retrieval via Synthetic Semantic Information Gain Reward

Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval. The code is available at https://github.com/dl-m9/InfoReasoner

翻译：智能体推理使大型推理模型能够动态获取外部知识，但优化检索过程仍因缺乏密度足够且具有原则性的奖励信号而面临挑战。本文提出统一框架InfoReasoner，通过合成语义信息增益奖励激励有效的信息搜索行为。理论上，我们将信息增益重新定义为模型信念状态的不确定性缩减，并建立了其非负性、可加性及信道单调性等保证。实践层面，为实现无需人工检索标注的可扩展优化，我们提出了一种输出感知的内在估计器，通过双向文本蕴含的语义聚类直接从模型输出分布计算信息增益。该内在奖励引导策略最大化认知进步，从而支持通过群组相对策略优化进行高效训练。在七个问答基准上的实验表明，InfoReasoner持续优于强检索增强基线方法，平均准确率提升最高达5.4%。本研究为基于检索的智能体推理提供了理论严谨且可扩展的实现路径。代码开源于https://github.com/dl-m9/InfoReasoner

相关内容

信息增益

关注 0

信息增益（Kullback–Leibler divergence）又叫做information divergence，relative entropy 或者KLIC。在概率论和信息论中，信息增益是非对称的，用以度量两种概率分布P和Q的差异。信息增益描述了当使用Q进行编码时，再使用P进行编码的差异。通常P代表样本或观察值的分布，也有可能是精确计算的理论分布。Q代表一种理论，模型，描述或者对P的近似。

大语言模型的智能体化推理

专知会员服务

35+阅读 · 1月21日

【AAAI2026】善始则事半功倍：基于前缀优化的大语言模型推理强化学习

专知会员服务

13+阅读 · 2025年12月19日

基于强化学习的智能体化搜索全面综述：基础、角色、优化、评估与应用

专知会员服务

23+阅读 · 2025年10月22日

基于大语言模型的深度搜索智能体综述：范式、优化、评测与挑战

专知会员服务

34+阅读 · 2025年8月11日