Repository-level code editing requires models to understand complex dependencies and execute precise multi-file modifications across a large codebase. While recent gains on SWE-bench rely heavily on complex agent scaffolding, it remains unclear how much of this capability can be internalised via high-quality training signals. To address this, we propose Clean Pull Request (Clean-PR), a mid-training paradigm that leverages real-world GitHub pull requests as a training signal for repository-level editing. We introduce a scalable pipeline that converts noisy pull request diffs into Search/Replace edit blocks through reconstruction and validation, resulting in the largest publicly available corpus of 2 million pull requests spanning 12 programming languages. Using this training signal, we perform a mid-training stage followed by an agentless-aligned supervised fine-tuning process with error-driven data augmentation. On SWE-bench, our model significantly outperforms the instruction-tuned baseline, achieving absolute improvements of 13.6% on SWE-bench Lite and 12.3% on SWE-bench Verified. These results demonstrate that repository-level code understanding and editing capabilities can be effectively internalised into model weights under a simplified, agentless protocol, without relying on heavy inference-time scaffolding.
翻译:仓库级代码编辑要求模型理解复杂的依赖关系,并在大型代码库中执行精确的多文件修改。尽管近期在SWE-bench上的进展严重依赖于复杂的智能体框架,但尚不清楚这种能力有多少可以通过高质量的训练信号内化。为此,我们提出"纯净拉取请求"(Clean-PR)——一种利用真实GitHub拉取请求作为仓库级编辑训练信号的中期训练范式。我们引入了一个可扩展的流程,通过重构和验证将含有噪声的拉取请求差异转换为搜索/替换编辑块,从而构建了涵盖12种编程语言、包含200万个拉取请求的最大公开语料库。利用该训练信号,我们执行了中期训练阶段,随后进行了无智能体对齐的监督微调过程,并辅以错误驱动的数据增强。在SWE-bench上,我们的模型显著优于指令微调基线,在SWE-bench Lite和SWE-bench Verified上分别实现了13.6%和12.3%的绝对性能提升。这些结果表明,在简化的无智能体协议下,仓库级代码理解与编辑能力能够有效内化至模型权重中,而无需依赖繁重的推理时框架。