Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis benchmarking framework to rigorously evaluate the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HUMANEVAL benchmark by 81x to build HUMANEVAL+. Our extensive evaluation across 19 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HUMANEVAL+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by 13.6-15.3% on average. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing.
翻译:程序合成研究由来已久,近期方法主要聚焦于直接利用大语言模型(LLMs)生成代码。通过包含精选合成问题和测试用例的编程基准测试,可衡量不同LLMs在代码合成上的性能。然而,现有测试用例在全面评估生成代码的功能正确性方面存在数量和质量的局限性。这种基准测试的缺陷引出了以下问题:在LLM时代,生成的代码真的正确吗?为回答这一问题,我们提出EvalPlus——一个用于严格评估LLM合成代码功能正确性的代码合成基准测试框架。EvalPlus通过自动测试输入生成器(结合LLM和基于变异的策略)大规模新增测试用例来增强现有评估数据集。虽然EvalPlus具有通用性,但我们将流行基准HUMANEVAL的测试用例扩展81倍构建了HUMANEVAL+。针对19个流行LLM(如GPT-4和ChatGPT)的广泛评估表明,HUMANEVAL+能够捕获大量此前未被检测到的LLM错误合成代码,使pass@k平均降低13.6-15.3%。我们的工作不仅揭示了先前流行的代码合成评估结果未能准确反映LLM在代码合成中的真实性能,还开辟了通过自动化测试改进此类编程基准测试的新方向。