Reinforcement Learning has emerged as a key driver for LLM reasoning. This capability is equally pivotal in long-context scenarios--such as long-dialogue understanding and structured data analysis, where the challenge extends beyond consuming tokens to performing rigorous deduction. While existing efforts focus on data synthesis or architectural changes, recent work points out that relying solely on sparse, outcome-only rewards yields limited gains, as such coarse signals are often insufficient to effectively guide the complex long-context reasoning. To address this, we propose LongR, a unified framework that enhances long-context performance by integrating a dynamic "Think-and-Read" mechanism, which interleaves reasoning with document consultation, with a contextual density reward based on relative information gain to quantify the utility of the relevant documents. Empirically, LongR achieves a 9% gain on LongBench v2 and consistent improvements on RULER and InfiniteBench, demonstrating robust efficiency in navigating extensive contexts. Furthermore, LongR consistently enhances performance across diverse RL algorithms (e.g., DAPO, GSPO). Finally, we conduct in-depth analyses to investigate the impact of reasoning chain length on efficiency and the model's robustness against distractors.
翻译:强化学习已成为驱动大语言模型推理能力发展的关键因素。在长上下文场景中——例如长对话理解与结构化数据分析——这种能力同样至关重要,因为此类任务不仅需要处理大量文本,更需执行严谨的推理演绎。现有研究多集中于数据合成或架构调整,而近期工作指出,仅依赖稀疏的结果性奖励收益有限,此类粗粒度信号往往不足以有效指导复杂的长上下文推理过程。为此,我们提出LongR——一个通过整合动态“思考-查阅”机制与基于相对信息增益的上下文密度奖励来提升长上下文性能的统一框架。该机制交替进行推理与文档检索,而密度奖励则用于量化相关文档的效用价值。实验表明,LongR在LongBench v2上实现了9%的性能提升,并在RULER与InfiniteBench上取得持续改进,展现出驾驭长上下文的强劲效能。此外,LongR能稳定提升多种强化学习算法(如DAPO、GSPO)的性能。最后,我们通过深入分析探究了推理链长度对效率的影响,以及模型对干扰信息的鲁棒性。