The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff), even though code review is a one-to-many problem like generation and summarization with many "valid reviews" for a diff. To tackle these issues we develop a CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.6k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
翻译:自动化代码审查任务近期在机器学习领域获得了广泛关注。然而,当前的审查评论评估指标依赖于与给定代码变更(亦称差异代码)的人工撰写参考进行对比,尽管代码审查如同生成与摘要任务一样属于一对多问题——同一份差异代码可能存在多种“有效审查”。为解决这些问题,我们开发了CRScore——一种无需参考的指标,用于衡量审查质量的多个维度,如简洁性、全面性与相关性。我们设计CRScore的评估方式基于大语言模型和静态分析器在代码中检测到的声明与潜在问题。实验证明,CRScore能够生成有效且细粒度的审查质量评分,其与人类判断具有最高的一致性(斯皮尔曼相关系数达0.54),且比基于参考的指标更具敏感性。我们还发布了包含2.6k条人工标注的机器生成与GitHub审查评论质量评分的数据集,以支持自动化指标的后续开发。