ICE-Score: Instructing Large Language Models to Evaluate Code

Recent advancements in the field of natural language generation have facilitated the use of large language models to assess the quality of generated text. Although these models have shown promising results in tasks such as machine translation and summarization, their applicability in code intelligence tasks remains limited without human involvement. The complexity of programming concepts required for such tasks makes it difficult to develop evaluation metrics that align with human judgment. Token-matching-based metrics, such as BLEU, have demonstrated weak correlations with human practitioners in code intelligence tasks. Moreover, utilizing human-written test suites to evaluate functional correctness can be challenging in domains with low resources. To overcome these obstacles, we propose \texttt{ICE-Score}, a new evaluation metric via instructing large language models (LLMs) for code assessments. Our metric addresses the limitations of existing approaches by achieving superior correlations with functional correctness and human preferences, without the need for test oracles or references. We evaluate the efficacy of our metric on two different aspects (\textit{human preference} and \textit{execution success}) and four programming languages. Our results demonstrate that our metric surpasses state-of-the-art metrics for code generation, delivering high levels of accuracy and consistency across various programming languages and tasks. We also make our evaluation metric and datasets available to the public\footnote{\url{https://github.com/terryyz/ice-score}}, encouraging further research in evaluating code intelligence tasks.

翻译：自然语言生成领域的最新进展促进了利用大语言模型评估生成文本质量的研究。尽管这些模型在机器翻译、摘要生成等任务中已展现出良好效果，但在无需人工参与的代码智能任务中，其适用性仍然有限。此类任务所涉及的编程概念复杂性，使得开发与人类判断一致的评估指标变得困难。在代码智能任务中，基于令牌匹配的指标（如BLEU）已被证实与人类实践者的相关性较弱。此外，在低资源领域使用人工编写的测试套件评估功能正确性也颇具挑战。为克服这些障碍，我们提出\texttt{ICE-Score}——一种通过引导大语言模型进行代码评估的新型指标。该指标无需测试预言或参考样本，通过实现与功能正确性和人类偏好的高度相关性，弥补了现有方法的不足。我们在两个维度（\textit{人类偏好}与\textit{执行成功率}）及四种编程语言上评估了该指标的有效性。结果表明，我们的指标在代码生成任务中超越了现有最优指标，在不同编程语言和任务中均展现出高精度与一致性。我们已将评估指标及数据集开源（\url{https://github.com/terryyz/ice-score}），以推动代码智能任务评估领域的进一步研究。