SWE-Marathon: Can Agents Autonomously Complete Ultra-Long-Horizon Software Work?

Rishi Desai,Jesse Hu,Joan Cabezas,Neel Harsola,Pratyush Shukla,Roey Ben Chaim,Adnan El Assadi,Omkaar Mukund Kamath,Fenil Faldu,Prannay Hebbar,Jiankai Sun,Yiyuan Li,Pramod Srinivasan,Ishan Gupta,Christopher Settles,Daniel Wang,Derek Chen,Pranav Raja,Albert Liu,Marek Šuppa,Nevasini Sasikumar,Luyang Kong,Erik Quintanilla,Xiangyi Li,Ivan Bercovich,Steven Dillmann

AI agents are increasingly expected to complete long-horizon workflows that require sustained progress over hours, millions of tokens, and complex environments. Yet current agent benchmarks largely evaluate short-form tasks, such as single pull requests, small tickets, or 5-10 minute exercises, limiting our ability to measure agents' capabilities in planning, long-context understanding, and memory use. We introduce SWE-Marathon, a benchmark of 20 long-horizon tasks spanning software engineering and adjacent technical domains. Each task consists of a unique executable environment, a human-written reference solution, and a multi-layer verification suite. Logged agent attempts average 27.2M total tokens, making SWE-Marathon substantially longer-horizon than existing SWE and command-line agent benchmarks. Current frontier coding agents solve fewer than 30% of tasks. Failures often arise from poor self-verification, self-reported infeasibility, and premature termination. We also observe reward-hacking behavior in 13.8% of rollouts, where agents attempt to exploit the environment or verifier to bypass the intended workflow. SWE-Marathon includes adversarial review of test suites and execution environments, as well as multi-layer checks designed to prevent shortcut solutions. We release SWE-Marathon, evaluation code, and agent trajectories at https://swe-marathon.org/.

翻译：AI智能体日益被期望能够完成需要数小时、数百万token密集进展及复杂环境的长期工作流。然而，当前的智能体基准主要评估短时任务，例如单次代码合并请求、小型工单或5-10分钟的练习，这限制了我们在规划、长上下文理解和记忆使用方面衡量智能体能力的能力。我们提出了SWE-Marathon，一个包含20个跨越软件工程及相关技术领域的长期视界任务的基准。每个任务包含一个独特的可执行环境、一份人工撰写的参考解决方案以及一套多层验证体系。记录的智能体尝试平均消耗2720万个token，使SWE-Marathon的视界长度显著超过现有的SWE和命令行智能体基准。当前前沿的编码智能体解决的任务比例不足30%。失败通常源于自我验证能力不足、智能体自报告不可行性以及过早终止。我们还在13.8%的执行轮次中观察到奖励黑客行为，即智能体试图利用环境或验证器绕过既定工作流。SWE-Marathon包含对测试套件和执行环境的对抗性审查，以及旨在防止捷径解决方案的多层检查。我们在https://swe-marathon.org/上开源了SWE-Marathon、评估代码及智能体轨迹。