Reinforcement Learning (RL) has empowered Large Language Models (LLMs) to master autonomous search for complex question answering. However, particularly within multi-turn search scenarios, this interaction introduces a critical challenge: search results often suffer from high redundancy and low signal-to-noise ratios. Consequently, agents easily fall into "Tunnel Vision," where the forced interpretation of early noisy retrievals leads to irreversible error accumulation. To address these challenges, we propose SIGHT, a framework that enhances search-based reasoning through Self-Evidence Support (SES) and Information-Gain Driven Diverse Branching. SIGHT distills search results into high-fidelity evidence via SES and calculates an Information Gain score to pinpoint pivotal states where observations maximally reduce uncertainty. This score guides Dynamic Prompting Interventions - including de-duplication, reflection, or adaptive branching - to spawn new branches with SES. Finally, by integrating SES and correctness rewards via Group Relative Policy Optimization, SIGHT internalizes robust exploration strategies without external verifiers. Experiments on single-hop and multi-hop QA benchmarks demonstrate that SIGHT significantly outperforms existing approaches, particularly in complex reasoning scenarios, using fewer search steps.
翻译:强化学习(RL)赋能大型语言模型(LLM),使其能够自主搜索以完成复杂问答任务。然而,在多轮搜索场景中,这种交互引入了一个关键挑战:搜索结果往往存在高冗余和低信噪比的问题。因此,智能体容易陷入“隧道视野”困境,即对早期噪声检索结果的强制解释会导致不可逆的错误累积。为应对这些挑战,我们提出了SIGHT框架,该框架通过自证支持(SES)和信息增益驱动的多样分支来增强基于搜索的推理能力。SIGHT通过SES将搜索结果提炼为高保真证据,并计算信息增益分数以精确定位那些能最大程度降低不确定性的关键状态。该分数指导动态提示干预——包括去重、反思或自适应分支——以生成带有SES的新分支。最后,通过组相对策略优化整合SES与正确性奖励,SIGHT无需外部验证器即可内化稳健的探索策略。在单跳和多跳问答基准测试上的实验表明,SIGHT显著优于现有方法,尤其在复杂推理场景中,且使用更少的搜索步骤。