LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

Long-context reasoning remains a central challenge for large language models, which often fail to locate and integrate key information in extensive distracting content. Reinforcement learning with verifiable rewards (RLVR) has shown promise for this task, yet existing methods are limited by low-confusability distractors and sparse, outcome-only reward signals that cannot supervise intermediate reasoning steps. To address these issues, we introduce \textsc{LongTraceRL}. For data construction, we generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build \emph{tiered distractors}: documents the agent read but did not cite (high confusability) and documents that appeared in search results but were never opened (low confusability), producing training contexts that are far more challenging than those built by random sampling or one-shot search. For reward design, we propose a \emph{rubric reward} that uses the gold entities along each reasoning chain as fine-grained, entity-level process supervision. This rubric reward is applied only to responses with correct final answers (positive-only strategy), distinguishing the reasoning quality among correct responses and preventing reward hacking. Experiments on three reasoning LLMs (4B--30B) across five long-context benchmarks demonstrate that \textsc{LongTraceRL} consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. Codes, datasets and models are available at \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}.

翻译：长上下文推理仍然是大语言模型面临的核心挑战，模型常常难以在大量干扰信息中定位并整合关键信息。基于可验证奖励的强化学习（RLVR）在此任务上展现出潜力，但现有方法受限于低混淆度的干扰项以及稀疏且仅关注最终结果的奖励信号（无法监督中间推理步骤）。为解决这些问题，我们提出 \textsc{LongTraceRL}。在数据构建方面，我们通过知识图谱随机游走生成多跳问题，并利用搜索智能体轨迹构建**分层干扰项**：智能体已阅读但未引用的文档（高混淆度）和搜索结果中出现但从未被打开过的文档（低混淆度），从而生成远比随机采样或单次搜索构建的训练上下文更具挑战性的数据。在奖励设计方面，我们提出一种**评分奖励**，它利用每条推理链上的黄金实体作为细粒度的实体级过程监督。该评分奖励仅应用于最终答案正确的响应（仅正例策略），用于区分正确响应中的推理质量并防止奖励破解。在三个推理型大语言模型（4B-30B）上，跨越五个长上下文基准的实验表明，\textsc{LongTraceRL} 持续优于强基线方法，并鼓励全面且基于证据的推理。代码、数据集和模型已开源至 \href{https://github.com/THU-KEG/LongTraceRL}{https://github.com/THU-KEG/LongTraceRL}。