Agentic reasoning enables large reasoning models (LRMs) to dynamically acquire external knowledge, but yet optimizing the retrieval process remains challenging due to the lack of dense, principled reward signals. In this paper, we introduce InfoReasoner, a unified framework that incentivizes effective information seeking via a synthetic semantic information gain reward. Theoretically, we redefine information gain as uncertainty reduction over the model's belief states, establishing guarantees, including non-negativity, telescoping additivity, and channel monotonicity. Practically, to enable scalable optimization without manual retrieval annotations, we propose an output-aware intrinsic estimator that computes information gain directly from the model's output distributions using semantic clustering via bidirectional textual entailment. This intrinsic reward guides the policy to maximize epistemic progress, enabling efficient training via Group Relative Policy Optimization (GRPO). Experiments across seven question-answering benchmarks demonstrate that InfoReasoner consistently outperforms strong retrieval-augmented baselines, achieving up to 5.4% average accuracy improvement. Our work provides a theoretically grounded and scalable path toward agentic reasoning with retrieval. The code is available at https://github.com/dl-m9/InfoReasoner
翻译:智能体推理使大型推理模型能够动态获取外部知识,但优化检索过程仍因缺乏密度足够且具有原则性的奖励信号而面临挑战。本文提出统一框架InfoReasoner,通过合成语义信息增益奖励激励有效的信息搜索行为。理论上,我们将信息增益重新定义为模型信念状态的不确定性缩减,并建立了其非负性、可加性及信道单调性等保证。实践层面,为实现无需人工检索标注的可扩展优化,我们提出了一种输出感知的内在估计器,通过双向文本蕴含的语义聚类直接从模型输出分布计算信息增益。该内在奖励引导策略最大化认知进步,从而支持通过群组相对策略优化进行高效训练。在七个问答基准上的实验表明,InfoReasoner持续优于强检索增强基线方法,平均准确率提升最高达5.4%。本研究为基于检索的智能体推理提供了理论严谨且可扩展的实现路径。代码开源于https://github.com/dl-m9/InfoReasoner