Code benchmarks such as HumanEval are widely adopted to evaluate the capabilities of Large Language Models (LLMs), providing insights into their strengths and weaknesses. However, current benchmarks primarily exercise LLMs' capability on common coding tasks (e.g., bubble sort, greatest common divisor), leaving domain-specific coding tasks (e.g., computation, system, cryptography) unexplored. To fill this gap, we propose a multi-domain code benchmark, DOMAINEVAL, designed to evaluate LLMs' coding capabilities thoroughly. Our pipeline works in a fully automated manner, enabling a push-bottom construction from code repositories into formatted subjects under study. Interesting findings are observed by evaluating 12 representative LLMs against DOMAINEVAL. We notice that LLMs are generally good at computation tasks while falling short on cryptography and system coding tasks. The performance gap can be as much as 68.94% (80.94% - 12.0%) in some LLMs. We also observe that generating more samples can increase the overall performance of LLMs, while the domain bias may even increase. The contributions of this study include a code generation benchmark dataset DOMAINEVAL, encompassing six popular domains, a fully automated pipeline for constructing code benchmarks, and an identification of the limitations of LLMs in code generation tasks based on their performance on DOMAINEVAL, providing directions for future research improvements. The leaderboard is available at https://domaineval.github.io/.
翻译:诸如HumanEval之类的代码基准被广泛用于评估大型语言模型(LLMs)的能力,以揭示其优势与不足。然而,现有基准主要考察LLMs在常见编程任务(例如冒泡排序、最大公约数)上的能力,而领域特定的编程任务(例如计算、系统、密码学)则未被充分探索。为填补这一空白,我们提出了一个多领域代码基准DOMAINEVAL,旨在全面评估LLMs的代码生成能力。我们的构建流程完全自动化,能够将代码仓库自底向上地构建成格式化的待研究题目。通过使用DOMAINEVAL评估12个代表性LLMs,我们观察到一些有趣的发现。我们注意到,LLMs通常擅长计算任务,但在密码学和系统编程任务上表现欠佳。在某些LLMs中,性能差距可高达68.94%(80.94% - 12.0%)。我们还观察到,生成更多样本可以提高LLMs的整体性能,但领域偏差甚至可能增大。本研究的贡献包括:一个涵盖六个热门领域的代码生成基准数据集DOMAINEVAL,一个用于构建代码基准的全自动化流程,以及基于LLMs在DOMAINEVAL上的表现所识别出的其在代码生成任务中的局限性,为未来的研究改进提供了方向。排行榜可在 https://domaineval.github.io/ 查看。