EvoCodeBench: An Evolving Code Generation Benchmark with Domain-Specific Evaluations

How to evaluate Large Language Models (LLMs) in code generation remains an open question. Existing benchmarks have two limitations - data leakage and lack of domain-specific evaluation. The former hurts the fairness of benchmarks, and the latter hinders practitioners from selecting superior LLMs for specific programming domains. To address these two limitations, we propose a new benchmark - EvoCodeBench, which has the following advances: (1) Evolving data. EvoCodeBench will be dynamically updated every period (e.g., 6 months) to avoid data leakage. This paper releases the first version - EvoCodeBench-2403, containing 275 samples from 25 repositories. (2) A domain taxonomy and domain labels. Based on the statistics of open-source communities, we design a programming domain taxonomy consisting of 10 popular domains. Based on the taxonomy, we annotate each sample in EvoCodeBench with a domain label. (3) Domain-specific evaluations. Besides the Pass@k, we compute the Domain-Specific Improvement (DSI) and define LLMs' comfort and strange domains. These evaluations help practitioners select superior LLMs in specific domains and discover the shortcomings of existing LLMs. We evaluate 8 popular LLMs (e.g., gpt-4, DeepSeek Coder) on EvoCodeBench and summarize some insights. EvoCodeBench reveals the actual abilities of these LLMs in real-world repositories. For example, the highest Pass@1 of gpt-4 on EvoCodeBench-2403 is only 20.74%. Besides, we evaluate LLMs in different domains and discover their comfort and strange domains. For example, gpt-4 performs best in most domains but falls behind others in the Internet domain. StarCoder 2-15B unexpectedly performs well in the Database domain and even outperforms 33B LLMs. EvoCodeBench has been released.

翻译：如何评估大型语言模型（LLM）在代码生成方面的能力仍是一个开放性问题。现有基准存在两个局限性——数据泄露和缺乏领域特定评估。前者损害了基准的公平性，后者阻碍了从业者为特定编程领域选择更优的 LLM。为解决这两个局限性，我们提出了一个新的基准——EvoCodeBench，它具有以下优势：（1）演化数据。EvoCodeBench 将定期（例如每 6 个月）动态更新以避免数据泄露。本文发布了首个版本——EvoCodeBench-2403，包含来自 25 个代码仓库的 275 个样本。（2）领域分类体系与领域标签。基于开源社区的统计数据，我们设计了一个包含 10 个热门领域的编程领域分类体系。基于该分类体系，我们为 EvoCodeBench 中的每个样本标注了领域标签。（3）领域特定评估。除了 Pass@k，我们还计算了领域特定改进度，并定义了 LLM 的“舒适领域”与“陌生领域”。这些评估有助于从业者在特定领域选择更优的 LLM，并发现现有 LLM 的不足。我们在 EvoCodeBench 上评估了 8 个流行的 LLM（例如 gpt-4、DeepSeek Coder）并总结了一些发现。EvoCodeBench 揭示了这些 LLM 在真实世界代码仓库中的实际能力。例如，gpt-4 在 EvoCodeBench-2403 上的最高 Pass@1 仅为 20.74%。此外，我们评估了 LLM 在不同领域中的表现，并发现了它们的舒适领域与陌生领域。例如，gpt-4 在大多数领域表现最佳，但在互联网领域落后于其他模型。StarCoder 2-15B 在数据库领域意外表现良好，甚至优于某些 33B 参数的 LLM。EvoCodeBench 已公开发布。