This paper introduces KLong, an open-source LLM agent trained to solve extremely long-horizon tasks. The principle is to first cold-start the model via trajectory-splitting SFT, then scale it via progressive RL training. Specifically, we first activate basic agentic abilities of a base model with a comprehensive SFT recipe. Then, we introduce Research-Factory, an automated pipeline that generates high-quality training data by collecting research papers and constructing evaluation rubrics. Using this pipeline, we build thousands of long-horizon trajectories distilled from Claude 4.5 Sonnet (Thinking). To train with these extremely long trajectories, we propose a new trajectory-splitting SFT, which preserves early context, progressively truncates later context, and maintains overlap between sub-trajectories. In addition, to further improve long-horizon task-solving capability, we propose a novel progressive RL, which schedules training into multiple stages with progressively extended timeouts. Experiments demonstrate the superiority and generalization of KLong, as shown in Figure 1. Notably, our proposed KLong (106B) surpasses Kimi K2 Thinking (1T) by 11.28% on PaperBench, and the performance improvement generalizes to other coding benchmarks like SWE-bench Verified and MLE-bench.
翻译:本文提出KLong,一种面向超长时域任务的开源LLM智能体。其核心原理为:首先通过轨迹分割监督微调(trajectory-splitting SFT)对模型进行冷启动,进而通过渐进式强化学习(progressive RL)实现规模化扩展。具体而言,我们首先采用综合性监督微调方案激活基座模型的基础智能体能力;随后提出Research-Factory自动化流水线,通过收集研究论文并构建评估标准来生成高质量训练数据。该流水线生成数千条从Claude 4.5 Sonnet(Thinking)中蒸馏得到的超长时域轨迹。为训练这些超长轨迹,我们提出新型轨迹分割监督微调方法,该方法保留早期上下文、渐进截断后期上下文并维持子轨迹间的重叠区域。此外,为进一步增强超长时域任务求解能力,我们提出渐进式强化学习机制,通过分阶段训练并逐步延长超时限制实现。实验证明KLong的优越性与泛化能力(如图1所示)。值得注意的是,我们提出的KLong(106B)在PaperBench上超越Kimi K2 Thinking(1T)达11.28%,且该性能提升可泛化至SWE-bench Verified及MLE-bench等其他编程基准测试。