Recent advancements in large language models (LLMs) have significantly enhanced their coding capabilities. However, existing benchmarks predominantly focused on simplified or isolated aspects of coding, such as single-file code generation or repository issue debugging, falling short of measuring the full spectrum of challenges raised by real-world programming activities. In this case study, we explore the performance of LLMs across the entire software development lifecycle with DevEval, encompassing stages including software design, environment setup, implementation, acceptance testing, and unit testing. DevEval features four programming languages, multiple domains, high-quality data collection, and carefully designed and verified metrics for each task. Empirical studies show that current LLMs, including GPT-4, fail to solve the challenges presented within DevEval. Our findings offer actionable insights for the future development of LLMs toward real-world programming applications.
翻译:近期大型语言模型(LLMs)在代码能力方面取得了显著进展。然而,现有基准测试主要集中于简化或孤立的编码环节,例如单文件代码生成或仓库问题调试,未能全面衡量实际编程活动所提出的全方位挑战。在本案例研究中,我们通过DevEval评估了LLMs在整个软件开发生命周期中的表现,涵盖软件设计、环境配置、实现、验收测试与单元测试等阶段。DevEval具备四大编程语言支持、多领域覆盖、高质量数据采集,以及为每项任务精心设计与验证的评估指标。实证研究表明,当前包括GPT-4在内的主流LLMs均未能有效解决DevEval提出的挑战。我们的研究结果为LLMs面向实际编程应用的未来发展提供了可操作的见解。