Large language models and agents have achieved remarkable progress in code generation. However, existing benchmarks focus on isolated function/class-level generation (e.g., ClassEval) or modifications to existing codebases (e.g., SWE-Bench), neglecting complete microservice repository generation that reflects real-world 0-to-1 development workflows. To bridge this gap, we introduce RepoGenesis, the first multilingual benchmark for repository-level end-to-end web microservice generation, comprising 106 repositories (60 Python, 46 Java) across 18 domains and 11 frameworks, with 1,258 API endpoints and 2,335 test cases verified through a "review-rebuttal" quality assurance process. We evaluate open-source agents (e.g., DeepCode) and commercial IDEs (e.g., Cursor) using Pass@1, API Coverage (AC), and Deployment Success Rate (DSR). Results reveal that despite high AC (up to 73.91%) and DSR (up to 100%), the best-performing system achieves only 23.67% Pass@1 on Python and 21.45% on Java, exposing deficiencies in architectural coherence, dependency management, and cross-file consistency. Notably, GenesisAgent-8B, fine-tuned on RepoGenesis (train), achieves performance comparable to GPT-5 mini, demonstrating the quality of RepoGenesis for advancing microservice generation. We release our benchmark at https://github.com/microsoft/DKI_LLM/tree/main/RepoGenesis.
翻译:大型语言模型与智能体在代码生成领域已取得显著进展。然而,现有基准测试主要关注孤立的函数/类级别生成(如ClassEval)或对现有代码库的修改(如SWE-Bench),忽略了反映真实世界从零到一开发流程的完整微服务仓库生成。为填补这一空白,我们提出了RepoGenesis,这是首个面向仓库级端到端Web微服务生成的多语言基准测试,包含涵盖18个领域和11种框架的106个仓库(60个Python,46个Java),包含1,258个API端点和2,335个测试用例,并通过“评审-复审”质量保证流程进行验证。我们使用Pass@1、API覆盖率(AC)和部署成功率(DSR)评估了开源智能体(如DeepCode)和商业IDE(如Cursor)。结果显示,尽管AC(最高达73.91%)和DSR(最高达100%)表现优异,但性能最佳的系统在Python上仅达到23.67%的Pass@1,在Java上为21.45%,暴露出其在架构一致性、依赖管理和跨文件协调性方面的不足。值得注意的是,基于RepoGenesis(训练集)微调的GenesisAgent-8B模型,其性能可与GPT-5 mini相媲美,这证明了RepoGenesis在推动微服务生成研究方面的质量。我们已在https://github.com/microsoft/DKI_LLM/tree/main/RepoGenesis发布本基准测试。