The evolution of Large Language Models (LLMs) into autonomous agents has expanded the scope of AI coding from localized code generation to complex, repository-level, and execution-driven problem solving. However, current benchmarks predominantly evaluate code logic in static contexts, neglecting the dynamic, full-process requirements of real-world engineering, particularly in backend development which demands rigorous environment configuration and service deployment. To address this gap, we introduce ABC-Bench, a benchmark explicitly designed to evaluate agentic backend coding within a realistic, executable workflow. Using a scalable automated pipeline, we curated 224 practical tasks spanning 8 languages and 19 frameworks from open-source repositories. Distinct from previous evaluations, ABC-Bench require the agents to manage the entire development lifecycle from repository exploration to instantiating containerized services and pass the external end-to-end API tests. Our extensive evaluation reveals that even state-of-the-art models struggle to deliver reliable performance on these holistic tasks, highlighting a substantial disparity between current model capabilities and the demands of practical backend engineering. Our code is available at https://github.com/OpenMOSS/ABC-Bench.
翻译:大型语言模型(LLM)向自主智能体的演进,已将AI编码的范围从局部代码生成扩展到复杂的、仓库级别的、执行驱动的问题解决。然而,当前的基准测试主要评估静态环境下的代码逻辑,忽视了现实世界工程中动态的、全流程的需求,尤其是在需要严格环境配置和服务部署的后端开发领域。为弥补这一差距,我们引入了ABC-Bench,这是一个专门设计用于在现实、可执行的工作流中评估智能体后端编码能力的基准测试。利用一个可扩展的自动化流水线,我们从开源仓库中筛选出涵盖8种编程语言和19个框架的224项实际任务。与以往评估不同,ABC-Bench要求智能体管理从仓库探索到实例化容器化服务的整个开发生命周期,并通过外部端到端API测试。我们广泛的评估表明,即使是最先进的模型在这些整体性任务上也难以提供可靠的性能,这突显了当前模型能力与实际后端工程需求之间的巨大差距。我们的代码可在 https://github.com/OpenMOSS/ABC-Bench 获取。