Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.
翻译:自能够生成长表达式和语句(而非单个下一个令牌)的代码神经模型兴起以来,其主要问题之一便是可靠地评估其生成输出。本文提出CodeBERTScore:一种基于BERTScore(Zhang等人,2020)的代码生成自动评估指标。与BLEU测量精确令牌匹配不同,CodeBERTScore利用大型预训练模型的上下文编码,计算生成代码与参考代码中每个令牌之间的软相似度分数。此外,与BERTScore仅编码生成令牌不同,CodeBERTScore还编码生成代码周围的程序上下文。我们对四种编程语言进行了CodeBERTScore的广泛评估。发现与所有现有指标相比,CodeBERTScore与人类偏好及功能正确性的相关性更高。即,由CodeBERTScore评分更高的生成代码更可能被人类偏好,并在执行时功能正确。最后,虽然CodeBERTScore可使用多语言CodeBERT作为基础模型,但我们发布了五种语言特定的预训练模型,并公开了代码于https://github.com/neulab/code-bert-score。我们的语言特定模型已在Huggingface Hub上被下载超过25,000次。