The task of automated code review has recently gained a lot of attention from the machine learning community. However, current review comment evaluation metrics rely on comparisons with a human-written reference for a given code change (also called a diff). Furthermore, code review is a one-to-many problem, like generation and summarization, with many "valid reviews" for a diff. Thus, we develop CRScore - a reference-free metric to measure dimensions of review quality like conciseness, comprehensiveness, and relevance. We design CRScore to evaluate reviews in a way that is grounded in claims and potential issues detected in the code by LLMs and static analyzers. We demonstrate that CRScore can produce valid, fine-grained scores of review quality that have the greatest alignment with human judgment among open source metrics (0.54 Spearman correlation) and are more sensitive than reference-based metrics. We also release a corpus of 2.9k human-annotated review quality scores for machine-generated and GitHub review comments to support the development of automated metrics.
翻译:自动化代码审查任务近年来受到机器学习社区的广泛关注。然而,当前的审查评论评估指标依赖于与给定代码变更(亦称为差异补丁)的人工撰写参考进行对比。此外,代码审查属于一对多问题,类似于生成与摘要任务,单个差异补丁可能对应多种“有效审查”。为此,我们开发了CRScore——一种无需参考的度量指标,用于衡量审查质量的多个维度,如简洁性、全面性与相关性。CRScore的设计旨在通过大语言模型和静态分析器在代码中检测到的声明与潜在问题,实现对审查评论的评估。我们证明,CRScore能够生成有效且细粒度的审查质量评分,在开源指标中与人工判断具有最高的一致性(斯皮尔曼相关系数0.54),且比基于参考的指标更为敏感。同时,我们发布了包含2.9k条人工标注审查质量分数的语料库,涵盖机器生成与GitHub审查评论,以支持自动化评估指标的进一步发展。