SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents

Zhirui Zhang,Hongbo Zhang,Haoxiang Fei,Zhiyuan Bao,Yubin Chen,Zhengyu Lei,Ziyue Liu,Yixuan Sun,Mingkun Xiao,Zihang Ye,Yu Zhang,Hongcheng Zhu,Yuxiang Wen,Heung-Yeung Shum

from arxiv, 20 pages, 3 figures

Although large language models (LLMs) have demonstrated impressive coding capabilities, their ability to autonomously build production-scale software from explicit specifications remains an open question. We introduce SWE-AGI, an open-source benchmark for evaluating end-to-end, specification-driven construction of software systems written in MoonBit. SWE-AGI tasks require LLM-based agents to implement parsers, interpreters, binary decoders, and SAT solvers strictly from authoritative standards and RFCs under a fixed API scaffold. Each task involves implementing 1,000-10,000 lines of core logic, corresponding to weeks or months of engineering effort for an experienced human developer. By leveraging the nascent MoonBit ecosystem, SWE-AGI minimizes data leakage, forcing agents to rely on long-horizon architectural reasoning rather than code retrieval. Across frontier models, gpt-5.3-codex achieves the best overall performance (solving 19/22 tasks, 86.4%), outperforming claude-opus-4.6 (15/22, 68.2%), and kimi-2.5 exhibits the strongest performance among open-source models. Performance degrades sharply with increasing task difficulty, particularly on hard, specification-intensive systems. Behavioral analysis further reveals that as codebases scale, code reading, rather than writing, becomes the dominant bottleneck in AI-assisted development. Overall, while specification-driven autonomous software engineering is increasingly viable, substantial challenges remain before it can reliably support production-scale development.

翻译：尽管大型语言模型（LLMs）已展现出卓越的代码生成能力，但其能否根据明确规范自主构建生产级规模的软件系统仍是一个开放性问题。本文提出SWE-AGI——一个用于评估基于MoonBit语言、端到端规范驱动软件系统构建能力的开源基准。SWE-AGI要求基于LLM的智能体在固定API框架下，严格依据权威标准与RFC文档实现解析器、解释器、二进制解码器及SAT求解器。每项任务需实现1,000至10,000行核心逻辑代码，相当于经验丰富的人类开发者数周乃至数月的工程量。通过依托新兴的MoonBit生态系统，SWE-AGI最大程度减少了数据泄露风险，迫使智能体依赖长周期架构推理而非代码检索。在前沿模型中，gpt-5.3-codex取得最佳综合性能（完成22项任务中的19项，成功率86.4%），优于claude-opus-4.6（15/22，68.2%）；而kimi-2.5在开源模型中表现最为突出。随着任务难度增加（尤其是对规范要求严苛的复杂系统），模型性能呈现断崖式下降。行为分析进一步揭示：当代码库规模扩大时，代码阅读（而非编写）已成为AI辅助开发的主要瓶颈。总体而言，虽然规范驱动的自主软件工程正逐步可行，但要可靠支持生产级开发仍面临重大挑战。