We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eleven frontier models for the APEX-SWE leaderboard. Claude Opus 4.6 leads the APEX-SWE leaderboard with 40.5% Pass@1, followed by Claude Opus 4.5 at 38.7%. Our analysis shows that strong performance is primarily driven by epistemic discipline, defined as the capacity to distinguish between assumptions and verified facts. It is often combined with systematic verification prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
翻译:我们提出了人工智能软件工程生产力指数(APEX-SWE),这是一个用于评估前沿AI模型能否执行具有经济价值的软件工程工作的基准。与现有专注于狭窄、定义明确任务的评估不同,APEX-SWE评估了两种反映真实软件工程的新任务类型:(1)集成任务(n=100),要求跨异构云原语、业务应用和基础设施即代码服务构建端到端系统;(2)可观测性任务(n=100),要求使用日志和仪表盘等遥测信号以及非结构化上下文调试生产故障。我们对十一个前沿模型进行了APEX-SWE排行榜评估。Claude Opus 4.6以40.5%的Pass@1领先APEX-SWE排行榜,其次是Claude Opus 4.5的38.7%。我们的分析表明,强大的性能主要由认知纪律驱动,定义为区分假设与验证事实的能力,且常与行动前的系统性验证相结合。我们开源了APEX-SWE评估工具包和开发集(n=50)。