Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.
翻译:现有AI编码智能体基准测试主要聚焦于孤立、单一问题任务,例如修复缺陷或添加小型功能。然而,现实世界的软件工程是一项长周期工作:开发者需要解读高层需求、协调跨多个文件的变更、在保持功能完整性的前提下对代码库进行多轮迭代演化。我们提出SWE-EVO基准测试,专门针对这一长周期软件演化挑战。该基准从七个成熟开源Python项目的发行说明中构建,包含48项需多步修改的任务,平均涉及21个文件,每个实例平均通过含874个测试的测试套件进行验证。实验揭示出显著的能力差距:搭载OpenHands的GPT-5.4在SWE-EVO上仅达25%,而GPT-5.2在SWE-Bench Verified上达72.80%,表明当前智能体在持续多文件推理方面存在困难。我们还提出修复率(Fix Rate)这一度量指标,用以捕捉这些复杂长周期任务的渐进式进展。