Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code according to user intent written in natural language. Code evaluation datasets, containing curated synthesis problems with input/output test-cases, are used to measure the performance of various LLMs on code synthesis. However, test-cases in these datasets can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. In short, EvalPlus takes in the base evaluation dataset and uses an automatic input generation step to produce and diversify large amounts of new test inputs using both LLM-based and mutation-based input generators to further validate the synthesized code. We extend the popular HUMANEVAL benchmark and build HUMANEVAL+ with 81x additionally generated tests. Our extensive evaluation across 14 popular LLMs demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 15.1% on average! Moreover, we even found several incorrect ground-truth implementations in HUMANEVAL. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis but also opens up a new direction to improve programming benchmarks through automated test input generation.
翻译:程序合成研究由来已久,近期方法聚焦于直接利用大规模语言模型(LLM)的能力,根据自然语言描述的用户意图生成代码。代码评估数据集包含精心设计的合成问题及输入/输出测试用例,用于衡量不同LLM在代码合成上的性能。然而,这些数据集中的测试用例在数量和质量上均存在局限,难以全面评估生成代码的功能正确性。现有基准测试的这一局限性引发了一个关键问题:在LLM时代,生成的代码是否真正正确?为回答这一问题,我们提出EvalPlus——一个用于严格评估LLM合成代码功能正确性的代码合成基准测试框架。简言之,EvalPlus以基础评估数据集为输入,通过自动化输入生成步骤,利用基于LLM和基于变异的输入生成器产生并多样化大量新的测试输入,以进一步验证合成代码。我们对流行的HUMANEVAL基准进行扩展,构建了HUMANEVAL+,额外生成81倍的测试用例。在14个主流LLM上的广泛评估表明,HUMANEVAL+能够捕捉大量此前未被发现的LLM合成错误代码,平均将pass@k降低15.1%!此外,我们甚至在HUMANEVAL中发现了若干错误的基础真实实现。我们的工作不仅表明先前流行的代码合成评估结果未能准确反映LLM在代码合成上的真实性能,还为通过自动化测试输入生成改进编程基准开辟了新方向。