The rapid advancement in large language models (LLMs) has demonstrated significant potential in End-to-End Software Development (E2ESD). However, existing E2ESD benchmarks are limited by coarse-grained requirement specifications and unreliable evaluation protocols, hindering a true understanding of current framework capabilities. To address these limitations, we present E2EDev, a novel benchmark grounded in the principles of Behavior-Driven Development (BDD), which evaluates the capabilities of E2ESD frameworks by assessing whether the generated software meets user needs through mimicking real user interactions (Figure 1). E2EDev comprises (i) a fine-grained set of user requirements, (ii) multiple BDD test scenarios with corresponding Python step implementations for each requirement, and (iii) a fully automated testing pipeline built on the Behave framework. To ensure its quality while reducing the annotation effort, E2EDev leverages our proposed Human-in-the-Loop Multi-Agent Annotation Framework (HITL-MAA). By evaluating various E2ESD frameworks and LLM backbones with E2EDev, our analysis reveals a persistent struggle to effectively solve these tasks, underscoring the critical need for more effective and cost-efficient E2ESD solutions. Our codebase and benchmark are publicly available at https://github.com/SCUNLP/E2EDev.
翻译:大语言模型(LLM)的快速发展在端到端软件开发(E2ESD)中展现出巨大潜力。然而,现有的E2ESD基准测试受限于粗粒度的需求规范和不可靠的评估协议,阻碍了对当前框架能力的真实理解。为应对这些局限,我们提出了E2EDev——一个基于行为驱动开发(BDD)原则构建的新型基准测试,它通过模拟真实用户交互(图1)来评估生成软件是否满足用户需求,从而评测E2ESD框架的能力。E2EDev包含:(i)一组细粒度的用户需求,(ii)针对每项需求的多个BDD测试场景及相应的Python步骤实现,以及(iii)基于Behave框架构建的全自动化测试流水线。为在保证质量的同时减少标注工作量,E2EDev采用了我们提出的人机协同多智能体标注框架(HITL-MAA)。通过使用E2EDev评估多种E2ESD框架及LLM骨干模型,我们的分析表明,现有模型在有效解决此类任务上仍面临持续挑战,这凸显了对更高效且成本更优的E2ESD解决方案的迫切需求。我们的代码库与基准测试已公开于 https://github.com/SCUNLP/E2EDev。