CodeScore-R: An Automated Robustness Metric for Assessing the FunctionalCorrectness of Code Synthesis

Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics canbe classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-basedPass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculatingthis metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric thatcan assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metricshould be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes.To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder andcontrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such assketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate theinterference caused by identifiers, syntax structures, and operators on evaluation results. Experimental resultsdemonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms otherevaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness.

翻译：评估指标在代码合成领域至关重要。常用的代码评估指标可分为三类：基于匹配的、基于语义的和基于执行的。其中，基于执行的Pass@k指标通过执行测试用例准确评估预测代码的功能性。然而，计算该指标需要大量开销，因此需要设计一种无需测试用例即可评估预测代码功能性的自动化评估指标。此外，良好的评估指标应具备鲁棒性，即当预测代码发生微小变化时，该指标仍能保持准确性。为应对这些挑战，我们提出了一种基于UniXcoder和对比学习的自动化鲁棒性指标——CodeScore-R，用于评估代码合成的功能性。CodeScore-R采用草图处理、语法等价变换和变异测试等技术，有效减轻标识符、语法结构和运算符对评估结果的干扰。实验结果表明，在Java和Python的代码生成与迁移任务中，CodeScore-R优于其他评估指标，且与Pass@k指标更为一致，同时展现出更强的鲁棒性。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日