Evaluation metrics are crucial in the field of code synthesis. Commonly used code evaluation metrics canbe classified into three types: match-based, semantic-based, and execution-based. Among them, the execution-basedPass@k metric accurately assesses the functionality of predicted code by executing test cases. However, calculatingthis metric requires a significant amount of overhead, necessitating the design of an automated evaluation metric thatcan assess the functionality of predicted code without the need for test cases. Additionally, a good evaluation metricshould be robust, that is the metric can maintain its accuracy even when the predicted code undergoes minor changes.To address these challenges, we propose an automated robust metric, called CodeScore-R, based on UniXcoder andcontrastive learning, for evaluating the functionality of code synthesis. CodeScore-R employs techniques such assketch-based processing, syntactic-equivalent transformations, and mutation testing to effectively mitigate theinterference caused by identifiers, syntax structures, and operators on evaluation results. Experimental resultsdemonstrate that in the tasks of code generation and migration in Java and Python, CodeScore-R outperforms otherevaluation metrics and is more closely aligned with the Pass@k metric, while exhibiting stronger robustness.
翻译:评估指标在代码合成领域至关重要。常用的代码评估指标可分为三类:基于匹配的、基于语义的和基于执行的。其中,基于执行的Pass@k指标通过执行测试用例准确评估预测代码的功能性。然而,计算该指标需要大量开销,因此需要设计一种无需测试用例即可评估预测代码功能性的自动化评估指标。此外,良好的评估指标应具备鲁棒性,即当预测代码发生微小变化时,该指标仍能保持准确性。为应对这些挑战,我们提出了一种基于UniXcoder和对比学习的自动化鲁棒性指标——CodeScore-R,用于评估代码合成的功能性。CodeScore-R采用草图处理、语法等价变换和变异测试等技术,有效减轻标识符、语法结构和运算符对评估结果的干扰。实验结果表明,在Java和Python的代码生成与迁移任务中,CodeScore-R优于其他评估指标,且与Pass@k指标更为一致,同时展现出更强的鲁棒性。