Recent advances in Large Language Models (LLMs) have highlighted the challenge of handling long-context tasks, where models need to reason over extensive input contexts to aggregate target information. While Chain-of-Thought (CoT) prompting has shown promise for multi-step reasoning, its effectiveness for long-context scenarios remains underexplored. Through systematic investigation across diverse tasks, we demonstrate that CoT's benefits generalize across most long-context scenarios and amplify with increasing context length. Motivated by this critical observation, we propose LongRePS, a process-supervised framework that teaches models to generate high-quality reasoning paths for enhanced long-context performance. Our framework incorporates a self-sampling mechanism to bootstrap reasoning paths and a novel quality assessment protocol specifically designed for long-context scenarios. Experimental results on various long-context benchmarks demonstrate the effectiveness of our approach, achieving significant improvements over outcome supervision baselines on both in-domain tasks (+13.6/+3.8 points for LLaMA/Qwen on MuSiQue) and cross-domain generalization (+9.3/+8.1 points on average across diverse QA tasks). Our code, data and trained models are made public to facilitate future research.
翻译:大型语言模型(LLM)的最新进展凸显了处理长上下文任务的挑战,此类任务需要模型在大量输入上下文中进行推理以整合目标信息。尽管思维链(CoT)提示方法在多步推理中展现出潜力,但其在长上下文场景中的有效性仍未得到充分探索。通过对多种任务的系统性研究,我们证明CoT的益处可推广至大多数长上下文场景,且其效果随上下文长度增加而增强。基于这一关键发现,我们提出LongRePS——一种过程监督框架,通过教导模型生成高质量推理路径来提升长上下文性能。该框架包含用于自举推理路径的自采样机制,以及专为长上下文场景设计的新型质量评估协议。在多个长上下文基准测试上的实验结果表明,我们的方法在领域内任务(LLaMA/Qwen在MuSiQue上分别提升+13.6/+3.8分)和跨领域泛化任务(在多样化QA任务上平均提升+9.3/+8.1分)中均显著优于结果监督基线。我们已公开代码、数据及训练模型以促进后续研究。