How to evaluate Large Language Models (LLMs) in code generation is an open question. Many benchmarks have been proposed but are inconsistent with practical software projects, e.g., unreal program distributions, insufficient dependencies, and small-scale project contexts. Thus, the capabilities of LLMs in practical projects are still unclear. In this paper, we propose a new benchmark named DevEval, aligned with Developers' experiences in practical projects. DevEval is collected through a rigorous pipeline, containing 2,690 samples from 119 practical projects and covering 10 domains. Compared to previous benchmarks, DevEval aligns to practical projects in multiple dimensions, e.g., real program distributions, sufficient dependencies, and enough-scale project contexts. We assess five popular LLMs on DevEval (e.g., gpt-4, gpt-3.5-turbo, CodeLLaMa, and StarCoder) and reveal their actual abilities in code generation. For instance, the highest Pass@1 of gpt-3.5-turbo only is 42 in our experiments. We also discuss the challenges and future directions of code generation in practical projects. We open-source DevEval and hope it can facilitate the development of code generation in practical projects.
翻译:如何评估大语言模型(LLMs)在代码生成中的能力仍是一个开放性问题。现有基准测试虽已提出诸多方案,但存在与实际软件项目不一致的问题,例如程序分布不真实、依赖关系不充分、项目上下文规模过小等。因此,LLMs在实际项目中的能力仍不明朗。本文提出一个名为DevEval的新基准测试,该基准与开发者在实际项目中的开发经验对齐。DevEval通过严格流程构建,包含来自119个实际项目的2,690个样本,覆盖10个领域。相较于先前基准测试,DevEval在多个维度上与实际项目对齐,例如真实的程序分布、充分的依赖关系以及足够规模的项目上下文。我们在DevEval上评估了五种主流LLMs(如gpt-4、gpt-3.5-turbo、CodeLLaMa和StarCoder),揭示了它们在代码生成中的实际能力。例如,实验中gpt-3.5-turbo的最高Pass@1仅为42。我们还讨论了实际项目中代码生成面临的挑战与未来方向。我们已开源DevEval,期望能推动实际项目中代码生成技术的发展。