Deep research agents have shown remarkable potential in handling long-horizon tasks. However, state-of-the-art performance typically relies on online reinforcement learning (RL), which is financially expensive due to extensive API calls. While offline training offers a more efficient alternative, its progress is hindered by the scarcity of high-quality research trajectories. In this paper, we demonstrate that expensive online reinforcement learning is not all you need to build powerful research agents. To bridge this gap, we introduce a fully open-source suite designed for effective offline training. Our core contributions include DeepForge, a ready-to-use task synthesis framework that generates large-scale research queries without heavy preprocessing; and a curated collection of 66k QA pairs, 33k SFT trajectories, and 21k DPO pairs. Leveraging these resources, we train OffSeeker (8B), a model developed entirely offline. Extensive evaluations across six benchmarks show that OffSeeker not only leads among similar-sized agents but also remains competitive with 30B-parameter systems trained via heavy online RL.
翻译:深度研究智能体在处理长周期任务方面展现出显著潜力。然而,现有最优性能通常依赖于在线强化学习,这种训练方式因需要大量API调用而成本高昂。虽然离线训练提供了更高效的替代方案,但其发展受限于高质量研究轨迹的稀缺性。本文证明,构建强大的研究智能体并非必须依赖昂贵的在线强化学习。为弥合这一差距,我们引入了一套完全开源的高效离线训练工具集。核心贡献包括:DeepForge——一个无需繁重预处理即可生成大规模研究查询的即用型任务合成框架;以及精心整理的66k问答对、33k监督微调轨迹和21k直接偏好优化对数据集。基于这些资源,我们训练了完全通过离线方式开发的OffSeeker(8B)模型。在六个基准测试上的广泛评估表明,OffSeeker不仅在同等规模智能体中领先,还能与通过大量在线强化学习训练的300亿参数系统保持竞争力。