While Reinforcement Learning (RL) has advanced LLM reasoning, applying it to long-context scenarios is hindered by sparsity of outcome rewards. This limitation fails to penalize ungrounded "lucky guesses," leaving the critical process of needle-in-a-haystack evidence retrieval largely unsupervised. To address this, we propose EAPO (Evidence-Augmented Policy Optimization). We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling that precise evidence extraction is the decisive bottleneck for long-context reasoning. Guided by this insight, EAPO introduces a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward, providing dense process supervision to explicitly improve evidence quality. To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism. This mechanism iteratively refines the reward model using outcome-consistent rollouts, sharpening its discriminative capability to ensure precise process guidance. Comprehensive evaluations across eight benchmarks demonstrate that EAPO significantly enhances long-context reasoning performance compared to SOTA baselines.
翻译:尽管强化学习(RL)推动了大型语言模型(LLM)推理能力的发展,但其在长上下文场景中的应用仍受限于结果奖励的稀疏性。这一局限无法有效惩罚无根据的“侥幸猜测”,使得关键的“大海捞针”式证据检索过程在很大程度上缺乏监督。为解决此问题,我们提出了EAPO(证据增强的策略优化)。我们首先建立了证据增强推理范式,并通过树状结构证据抽样验证了精确的证据提取是长上下文推理的决定性瓶颈。基于这一洞见,EAPO引入了一种专门的RL算法,其中奖励模型计算组相对证据奖励,提供密集的过程监督以显式提升证据质量。为了在训练全程保持准确的监督,我们进一步引入了自适应奖励-策略协同进化机制。该机制利用结果一致的轨迹迭代优化奖励模型,增强其判别能力以确保精确的过程指导。在八个基准测试上的全面评估表明,与最先进的基线方法相比,EAPO显著提升了长上下文推理性能。