How to evaluate Large Language Models (LLMs) in code generation is an open question. Existing benchmarks demonstrate poor alignment with real-world code repositories and are insufficient to evaluate the coding abilities of LLMs. This paper proposes a new benchmark - EvoCodeBench to address the preceding problems, which has three primary advances. (1) EvoCodeBench aligns with real-world repositories in multiple dimensions, e.g., code distributions and dependency distributions. (2) EvoCodeBench offers comprehensive annotations (e.g., requirements, reference code, and reference dependencies), and robust evaluation metrics (e.g., Pass@k and Recall@k). (3) EvoCodeBench is an evolving benchmark to avoid data leakage. We build an automatic pipeline to update EvoCodeBench from the latest repositories. We release the first version - EvoCodeBench-2403, containing 275 samples from 25 real-world repositories. Based on EvoCodeBench, we propose repository-level code generation and evaluate 10 popular LLMs (e.g., gpt-4, gpt-3.5, DeepSeek Coder, StarCoder 2, CodeLLaMa, Gemma, and Qwen 1.5). Our experiments reveal the coding abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 only is 20.73% in our experiments. We also analyze failed cases and summarize the shortcomings of existing LLMs in EvoCodeBench. We release EvoCodeBench, all prompts, and LLMs' completions for further community analysis.
翻译:如何评估大型语言模型(LLMs)在代码生成方面的能力仍是一个待解决的问题。现有基准测试与现实代码仓库的对齐性较差,难以充分评估LLMs的编码能力。本文提出新基准EvoCodeBench以解决上述问题,其具有三大核心优势:(1)EvoCodeBench在代码分布、依赖分布等多个维度上与现实仓库保持对齐;(2)提供全面的标注信息(如需求、参考代码及参考依赖)与稳健的评估指标(如Pass@k和Recall@k);(3)作为演化式基准可避免数据泄露。我们构建自动化流水线,从最新仓库中持续更新EvoCodeBench。首个版本EvoCodeBench-2403包含来自25个真实仓库的275个样本。基于EvoCodeBench,我们提出仓库级代码生成任务,并评估了10种流行LLMs(如gpt-4、gpt-3.5、DeepSeek Coder、StarCoder 2、CodeLLaMa、Gemma及Qwen 1.5)。实验揭示了这些LLMs在真实仓库中的编码能力:例如,gpt-4的最高Pass@1仅达20.73%。同时,我们分析了失败案例,总结了现有LLMs在EvoCodeBench中的不足。为促进社区研究,我们公开了EvoCodeBench、所有提示词及LLMs的生成结果。