LongRLVR: Long-Context Reinforcement Learning Requires Verifiable Context Rewards

Reinforcement Learning with Verifiable Rewards (RLVR) has significantly advanced the reasoning capabilities of Large Language Models (LLMs) by optimizing them against factual outcomes. However, this paradigm falters in long-context scenarios, as its reliance on internal parametric knowledge is ill-suited for tasks requiring contextual grounding--the ability to find and reason over externally provided information. We identify a key reason for this failure: a reward based solely on the final answer is too sparse to effectively guide the model for identifying relevant evidence. We formally prove that the outcome-only reward leads to significant vanishing gradients for the context grounding process, rendering learning intractable. To overcome this bottleneck, we introduce LongRLVR to augment the sparse answer reward with a dense and verifiable context reward. This auxiliary signal directly incentivizes the model for selecting the correct grounding information, providing a robust learning gradient that solves the underlying optimization challenge. We validate our method on challenging long-context benchmarks using Qwen and LLaMA models. LongRLVR consistently and significantly outperforms the standard RLVR across all models and benchmarks, e.g., boosting a 14B model's scores on RULER-QA from 73.17 to 88.90 and on LongBench v2 from 39.8 to 46.5. Our work demonstrates that explicitly rewarding the grounding process is a critical and effective strategy for unlocking the full reasoning potential of LLMs in long-context applications. Our code is available at https://github.com/real-absolute-AI/LongRLVR.

翻译：可验证奖励强化学习（RLVR）通过基于事实结果优化大型语言模型（LLM），显著提升了其推理能力。然而，该范式在长上下文场景中表现不佳，因其依赖内部参数化知识，不适用于需要上下文基础——即查找并基于外部提供信息进行推理的任务。我们发现其失败的一个关键原因：仅基于最终答案的奖励过于稀疏，无法有效引导模型识别相关证据。我们严格证明了仅结果奖励会导致上下文基础过程出现显著的梯度消失，使得学习难以进行。为克服这一瓶颈，我们引入LongRLVR，通过密集且可验证的上下文奖励来增强稀疏的答案奖励。这一辅助信号直接激励模型选择正确的基础信息，提供了稳健的学习梯度，从而解决了底层的优化难题。我们在具有挑战性的长上下文基准测试中使用Qwen和LLaMA模型验证了我们的方法。LongRLVR在所有模型和基准测试中均一致且显著优于标准RLVR，例如，将14B模型在RULER-QA上的得分从73.17提升至88.90，在LongBench v2上从39.8提升至46.5。我们的工作表明，显式奖励基础过程是释放LLM在长上下文应用中全部推理潜力的关键且有效的策略。我们的代码发布于https://github.com/real-absolute-AI/LongRLVR。