daVinci-Dev: Agent-native Mid-training for Software Engineering

Ji Zeng,Dayuan Fu,Tiantian Mi,Yumin Zhuang,Yaxing Huang,Xuefeng Li,Lyumanshan Ye,Muhang Xie,Qishuo Hua,Zhen Huang,Mohan Jiang,Hanning Wang,Jifan Lin,Yang Xiao,Jie Sun,Yunze Wu,Pengfei Liu

Recently, the frontier of Large Language Model (LLM) capabilities has shifted from single-turn code generation to agentic software engineering-a paradigm where models autonomously navigate, edit, and test complex repositories. While post-training methods have become the de facto approach for code agents, **agentic mid-training**-mid-training (MT) on large-scale data that mirrors authentic agentic workflows-remains critically underexplored due to substantial resource requirements, despite offering a more scalable path to instilling foundational agentic behaviors than relying solely on expensive reinforcement learning. A central challenge in realizing effective agentic mid-training is the distribution mismatch between static training data and the dynamic, feedback-rich environment of real development. To address this, we present a systematic study of agentic mid-training, establishing both the data synthesis principles and training methodology for effective agent development at scale. Central to our approach is **agent-native data**-supervision comprising two complementary types of trajectories: **contextually-native trajectories** that preserve the complete information flow an agent experiences, offering broad coverage and diversity; and **environmentally-native trajectories** collected from executable repositories where observations stem from actual tool invocations and test executions, providing depth and interaction authenticity. We verify the model's agentic capabilities on `SWE-Bench Verified`. We demonstrate our superiority over the previous open software engineering mid-training recipe `Kimi-Dev` under two post-training settings with an aligned base model and agentic scaffold, while using less than half mid-training tokens (73.1B). Besides relative advantage, our best performing 32B and 72B models achieve **56.1%** and **58.5%** resolution rates, respectively, which are ...

翻译：近来，大型语言模型（LLM）能力的前沿已从单轮代码生成转向智能体式软件工程——一种模型能够自主导航、编辑和测试复杂代码仓库的范式。尽管后训练方法已成为代码智能体的事实标准，但**智能体中期训练**——即在模拟真实智能体工作流程的大规模数据上进行中期训练（MT）——尽管提供了比单纯依赖昂贵的强化学习更具可扩展性的路径来培养基础智能体行为，但由于巨大的资源需求，其探索仍严重不足。实现有效智能体中期训练的一个核心挑战在于静态训练数据与真实开发中动态、富含反馈的环境之间的分布不匹配。为解决此问题，我们提出了一项关于智能体中期训练的系统性研究，建立了大规模有效智能体开发的数据合成原则与训练方法。我们方法的核心是**原生智能体数据**——包含两种互补类型轨迹的监督数据：**上下文原生轨迹**，它保留了智能体所经历的完整信息流，提供了广泛的覆盖面和多样性；以及**环境原生轨迹**，从可执行仓库中收集，其观测源于实际的工具调用和测试执行，提供了深度和交互真实性。我们在`SWE-Bench Verified`上验证了模型的智能体能力。在使用对齐的基础模型和智能体框架的两种后训练设置下，我们证明了我们的方法优于先前开源的软件工程中期训练方案`Kimi-Dev`，同时使用的中期训练令牌数不到其一半（73.1B）。除了相对优势外，我们表现最佳的32B和72B模型分别实现了**56.1%**和**58.5%**的解决率，这...