Large Language Models Are State-of-the-Art Evaluators of Code Generation

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine translation and summarization, their applicability in code generation tasks remains limited without human involvement. The complexity of programming concepts required for such tasks makes it difficult to develop evaluation metrics that align with human judgment. Token-matching-based metrics, such as BLEU, have demonstrated weak correlations with human practitioners in code generation tasks. Moreover, the utilization of human-written test suites to evaluate functional correctness can be challenging in domains with low resources. To overcome these obstacles, we propose a new evaluation framework based on the GPT-3.5 (\texttt{GPT-3.5-turbo}), for code generation assessments. Our framework addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences, without the need for test oracles or references. We evaluate the efficacy of our framework on two different tasks and four programming languages, comparing its performance with the state-of-the-art CodeBERTScore metric, which relies on a pre-trained model. Our results demonstrate that our framework surpasses CodeBERTScore, delivering high levels of accuracy and consistency across various programming languages and tasks. We also make our evaluation framework and datasets available to the public at \url{https://github.com/terryyz/llm-code-eval}, encouraging further research in the evaluation of code generation.

翻译：自然语言生成领域的最新进展推动了大型语言模型在生成文本质量评估中的应用。尽管这些模型在机器翻译和文本摘要等任务中展现出良好效果，但在缺乏人工参与的情况下，其适用于代码生成任务的能力仍然有限。此类任务所需的编程概念复杂性使得开发与人类判断一致的评估指标存在困难。基于词元匹配的指标（如BLEU）在代码生成任务中与人类实践者的相关性较弱。此外，在资源匮乏的领域中，使用人工编写的测试套件评估功能正确性可能具有挑战性。为克服这些障碍，我们提出了一种基于GPT-3.5（\texttt{GPT-3.5-turbo}）的代码生成评估新框架。该框架无需测试预言或参考代码即可实现与功能正确性及人类偏好的高度相关性，从而突破了现有方法的局限性。我们在两项不同任务和四种编程语言上评估了该框架的有效性，并将其性能与依赖预训练模型的最新CodeBERTScore指标进行比较。结果表明，我们的框架在多种编程语言和任务中均超越了CodeBERTScore，实现了高准确性与一致性。我们还在 \url{https://github.com/terryyz/llm-code-eval} 公开了评估框架和数据集，以推动代码生成评估领域的进一步研究。