We introduce the AI Productivity Index for Software Engineering (APEX-SWE), a benchmark for assessing whether frontier AI models can execute economically valuable software engineering work. Unlike existing evaluations that focus on narrow, well-defined tasks, APEX-SWE assesses two novel task types that reflect real-world software engineering work: (1) Integration tasks (n=100), which require constructing end-to-end systems across heterogeneous cloud primitives, business applications, and infrastructure-as-code services, and (2) Observability tasks (n=100), which require debugging production failures using telemetry signals such as logs and dashboards, as well as unstructured context. We evaluated eight frontier models on APEX-SWE. Gemini 3 Pro (Thinking = High) performs best, with a Pass@1 score of 25\%. Our analysis shows that strong performance is primarily driven by epistemic reasoning, defined as the ability to distinguish between assumptions and verified facts, combined with agency to resolve uncertainty prior to acting. We open-source the APEX-SWE evaluation harness and a dev set (n=50).
翻译:我们提出了人工智能软件工程生产力指数(APEX-SWE),这是一个用于评估前沿人工智能模型能否执行具有经济价值的软件工程工作的基准。与现有专注于狭窄、明确定义任务的评估不同,APEX-SWE评估了两种反映现实世界软件工程工作的新型任务类型:(1)集成任务(n=100),要求跨异构云原语、业务应用和基础设施即代码服务构建端到端系统;以及(2)可观测性任务(n=100),要求使用日志和仪表板等遥测信号以及非结构化上下文来调试生产故障。我们在APEX-SWE上评估了八个前沿模型。Gemini 3 Pro(Thinking = High)表现最佳,其Pass@1得分为25%。我们的分析表明,强劲的性能主要由认知推理能力驱动,该能力定义为区分假设与已验证事实的能力,并结合了在行动前解决不确定性的能动性。我们开源了APEX-SWE评估框架和一个开发集(n=50)。