CodeBERTScore: Evaluating Code Generation with Pretrained Models of Code

Since the rise of neural models of code that can generate long expressions and statements rather than a single next-token, one of the major problems has been reliably evaluating their generated output. In this paper, we propose CodeBERTScore: an automatic evaluation metric for code generation, which builds on BERTScore (Zhang et al., 2020). Instead of measuring exact token matching as BLEU, CodeBERTScore computes a soft similarity score between each token in the generated code and in the reference code, using the contextual encodings of large pretrained models. Further, instead of encoding only the generated tokens as in BERTScore, CodeBERTScore also encodes the programmatic context surrounding the generated code. We perform an extensive evaluation of CodeBERTScore across four programming languages. We find that CodeBERTScore achieves a higher correlation with human preference and with functional correctness than all existing metrics. That is, generated code that receives a higher score by CodeBERTScore is more likely to be preferred by humans, as well as to function correctly when executed. Finally, while CodeBERTScore can be used with a multilingual CodeBERT as its base model, we release five language-specific pretrained models to use with our publicly available code at https://github.com/neulab/code-bert-score . Our language-specific models have been downloaded more than 25,000 times from the Huggingface Hub.

翻译：自能够生成长表达式和语句（而非单个下一个令牌）的代码神经模型兴起以来，其主要问题之一便是可靠地评估其生成输出。本文提出CodeBERTScore：一种基于BERTScore（Zhang等人，2020）的代码生成自动评估指标。与BLEU测量精确令牌匹配不同，CodeBERTScore利用大型预训练模型的上下文编码，计算生成代码与参考代码中每个令牌之间的软相似度分数。此外，与BERTScore仅编码生成令牌不同，CodeBERTScore还编码生成代码周围的程序上下文。我们对四种编程语言进行了CodeBERTScore的广泛评估。发现与所有现有指标相比，CodeBERTScore与人类偏好及功能正确性的相关性更高。即，由CodeBERTScore评分更高的生成代码更可能被人类偏好，并在执行时功能正确。最后，虽然CodeBERTScore可使用多语言CodeBERT作为基础模型，但我们发布了五种语言特定的预训练模型，并公开了代码于https://github.com/neulab/code-bert-score。我们的语言特定模型已在Huggingface Hub上被下载超过25,000次。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

167+阅读 · 2020年3月18日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日

微软发布DialoGPT预训练语言模型，论文与代码 Large-Scale Generative Pre-training for Conversational Response Generation

专知会员服务

28+阅读 · 2019年11月8日