A proper code evaluation metric (CEM) profoundly impacts the evolution of code generation, which is an important research field in NLP and software engineering. Prevailing match-based CEMs (e.g., BLEU, Accuracy, and CodeBLEU) suffer from two significant drawbacks. 1. They primarily measure the surface differences between codes without considering their functional equivalence. However, functional equivalence is pivotal in evaluating the effectiveness of code generation, as different codes can perform identical operations. 2. They are predominantly designed for the Ref-only input format. However, code evaluation necessitates versatility in input formats. Aside from Ref-only, there are NL-only and Ref\&NL formats, which existing match-based CEMs cannot effectively accommodate. In this paper, we propose CodeScore, a large language model (LLM)-based CEM, which estimates the functional correctness of generated code on three input types. To acquire CodeScore, we present UniCE, a unified code generation learning framework, for LLMs to learn code execution (i.e., learning PassRatio and Executability of generated code) with unified input. Extensive experimental results on multiple code evaluation datasets demonstrate that CodeScore absolutely improves up to 58.87% correlation with functional correctness compared to other CEMs, achieves state-of-the-art performance, and effectively handles three input formats.
翻译:合适的代码评估指标深刻影响代码生成的发展,这是自然语言处理与软件工程领域的重要研究方向。现有基于匹配的评估指标(如BLEU、Accuracy和CodeBLEU)存在两大缺陷:1)主要衡量代码间的表面差异,未考虑功能等价性——而功能等价性对评估代码生成效果至关重要,因为不同代码可能执行相同操作;2)主要面向仅参考代码输入格式设计,但代码评估需要支持多种输入格式。除仅参考代码外,还存在仅自然语言和参考代码与自然语言两种格式,现有基于匹配的评估指标无法有效适配。本文提出基于大语言模型的评估指标CodeScore,该指标可针对三种输入类型评估生成代码的功能正确性。为获取CodeScore,我们提出统一代码生成学习框架UniCE,使大语言模型通过统一输入学习代码执行(即生成代码的通过率与可执行性)。在多个代码评估数据集上的大量实验表明,相比其他评估指标,CodeScore与功能正确性的相关性最高提升58.87%,取得最优性能,并可有效处理三种输入格式。