Recent coding agents can generate complete codebases from simple prompts, yet existing evaluations focus on issue-level bug fixing and lag behind end-to-end development. We introduce ProjDevBench, an end-to-end benchmark that provides project requirements to coding agents and evaluates the resulting repositories. Combining Online Judge (OJ) testing with LLM-assisted code review, the benchmark evaluates agents on (1) system architecture design, (2) functional correctness, and (3) iterative solution refinement. We curate 20 programming problems across 8 categories, covering both concept-oriented tasks and real-world application scenarios, and evaluate six coding agents built on different LLM backends. Our evaluation reports an overall acceptance rate of 27.38%: agents handle basic functionality and data structures but struggle with complex system design, time complexity optimization, and resource management. Our benchmark is available at https://github.com/zsworld6/projdevbench.
翻译:近期涌现的编程智能体已能够根据简单提示生成完整代码库,然而现有评估方法仍局限于问题级别的缺陷修复,未能跟上端到端开发的发展需求。本文提出ProjDevBench——一个端到端的基准测试框架,该框架向编程智能体提供项目需求并评估生成的代码仓库。通过结合在线评测系统测试与LLM辅助的代码审查,本基准从以下三个维度评估智能体性能:(1) 系统架构设计能力,(2) 功能正确性,(3) 迭代式解决方案优化能力。我们精心构建了涵盖8个类别的20个编程问题,既包含概念导向任务也覆盖实际应用场景,并评估了基于不同LLM后端构建的六款编程智能体。评估结果显示总体通过率仅为27.38%:智能体能够处理基础功能与数据结构,但在复杂系统设计、时间复杂度优化及资源管理方面仍面临挑战。本基准测试平台已开源:https://github.com/zsworld6/projdevbench。