Automated Code Review (ACR) is crucial for software quality, yet existing benchmarks often fail to reflect real-world complexities, hindering the evaluation of modern Large Language Models (LLMs). Current benchmarks frequently focus on fine-grained code units, lack complete project context, and use inadequate evaluation metrics. To address these limitations, we introduce SWRBench , a new benchmark comprising 1000 manually verified Pull Requests (PRs) from GitHub, offering PR-centric review with full project context. SWRBench employs an objective LLM-based evaluation method that aligns strongly with human judgment (~90 agreement) by verifying if issues from a structured ground truth are covered in generated reviews. Our systematic evaluation of mainstream ACR tools and LLMs on SWRBench reveals that current systems underperform, and ACR tools are more adept at detecting functional errors. Subsequently, we propose and validate a simple multi-review aggregation strategy that significantly boosts ACR performance, increasing F1 scores by up to 43.67%. Our contributions include the SWRBench benchmark, its objective evaluation method, a comprehensive study of current ACR capabilities, and an effective enhancement approach, offering valuable insights for advancing ACR research.
翻译:自动代码审查(ACR)对软件质量至关重要,然而现有的基准测试往往无法反映真实世界的复杂性,阻碍了对现代大语言模型(LLM)的评估。当前的基准测试通常侧重于细粒度的代码单元,缺乏完整的项目上下文,且使用不充分的评估指标。为解决这些局限,我们提出了SWRBench——一个包含来自GitHub的1000个手动验证的拉取请求(PR)的新基准测试,提供面向PR且包含完整项目上下文的审查。SWRBench采用一种与人工判断高度一致(一致率约90%)的客观LLM评估方法,通过验证生成审查中是否覆盖了结构化基准答案中的问题点来进行评估。我们对主流ACR工具和LLM在SWRBench上的系统性评估显示,现有系统表现欠佳,且ACR工具更善于检测功能错误。随后,我们提出并验证了一种简单的多审查聚合策略,该策略显著提升了ACR性能,将F1分数最高提升了43.67%。我们的贡献包括SWRBench基准测试、其客观评估方法、对当前ACR能力的全面研究以及一种有效的增强方法,为推进ACR研究提供了宝贵见解。