When assessing the quality of coding agents, predominant benchmarks focus on solving single issues on GitHub, such as SWE-Bench. In contrast, in real use, these agents solve more various and complex tasks that involve other skills such as exploring codebases, testing software, and designing architecture. In this paper, we first characterize some transferable skills that are shared across diverse tasks by decomposing trajectories into fine-grained components, and derive a set of principles for designing auxiliary training tasks to teach language models these skills. Guided by these principles, we propose a training environment, Hybrid-Gym, consisting of a set of scalable synthetic tasks, such as function localization and dependency search. Experiments show that agents trained on our synthetic tasks effectively generalize to diverse real-world tasks that are not present in training, improving a base model by 25.4% absolute gain on SWE-Bench Verified, 7.9% on SWT-Bench Verified, and 5.1% on Commit-0 Lite. Hybrid-Gym also complements datasets built for the downstream tasks (e.g., improving SWE-Play by 4.9% on SWT-Bench Verified). Code available at: https://github.com/yiqingxyq/Hybrid-Gym.
翻译:在评估编码智能体的质量时,主流基准(如SWE-Bench)主要关注解决GitHub上的单一问题。然而,在实际应用中,这些智能体需要解决更加多样和复杂的任务,涉及探索代码库、测试软件和设计架构等其他技能。本文首先通过将任务轨迹分解为细粒度组件,刻画了跨不同任务共享的可迁移技能,并推导出一套设计辅助训练任务以教授语言模型这些技能的原则。基于这些原则,我们提出了一个训练环境Hybrid-Gym,它包含一组可扩展的合成任务,例如函数定位和依赖项搜索。实验表明,在我们的合成任务上训练的智能体能够有效泛化至训练中未出现的多样化现实任务,在SWE-Bench Verified上使基础模型的绝对性能提升25.4%,在SWT-Bench Verified上提升7.9%,在Commit-0 Lite上提升5.1%。Hybrid-Gym还能补充为下游任务构建的数据集(例如,在SWT-Bench Verified上将SWE-Play的性能提升4.9%)。代码发布于:https://github.com/yiqingxyq/Hybrid-Gym。