基于丰富上下文的细粒度代码审查实践中的大语言模型基准测试 (Benchmarking LLMs for Fine-Grained Code Review with Enriched Context in Practice)

Code review is a cornerstone of software quality assurance, and recent advances in Large Language Models (LLMs) have shown promise in its automation. However, existing benchmarks for LLM-based code review face three major limitations. Lack of semantic context: most benchmarks provide only code diffs without textual information such as issue descriptions, which are crucial for understanding developer intent. Data quality issues: without rigorous validation, many samples are noisy-e.g., reviews on outdated or irrelevant code-reducing evaluation reliability. Coarse granularity: most benchmarks operate at the file or commit level, overlooking the fine-grained, line-level reasoning essential for precise review. We introduce ContextCRBench, a high-quality, context-rich benchmark for fine-grained LLM evaluation in code review. Our construction pipeline comprises: Raw Data Crawling, collecting 153.7K issues and pull requests from top-tier repositories; Comprehensive Context Extraction, linking issue-PR pairs for textual context and extracting the full surrounding function or class for code context; and Multi-stage Data Filtering, combining rule-based and LLM-based validation to remove outdated, malformed, or low-value samples, resulting in 67,910 context-enriched entries. ContextCRBench supports three evaluation scenarios aligned with the review workflow: hunk-level quality assessment, line-level defect localization, and line-level comment generation. Evaluating eight leading LLMs (four closed-source and four open-source) reveals that textual context yields greater performance gains than code context alone, while current LLMs remain far from human-level review ability. Deployed at ByteDance, ContextCRBench drives a self-evolving code review system, improving performance by 61.98% and demonstrating its robustness and industrial utility. https://github.com/kinesiatricssxilm14/ContextCRBench.

翻译：代码审查是软件质量保证的基石，而大语言模型（LLMs）的最新进展为其自动化展现了潜力。然而，现有的基于LLM的代码审查基准面临三个主要局限。语义上下文缺失：大多数基准仅提供代码差异，而缺少诸如问题描述等文本信息，这对于理解开发者意图至关重要。数据质量问题：未经严格验证，许多样本存在噪声——例如，对过时或无关代码的审查——降低了评估的可靠性。粒度粗糙：大多数基准在文件或提交级别运行，忽视了精确审查所必需的细粒度、行级推理。我们引入了ContextCRBench，一个用于代码审查中细粒度LLM评估的高质量、上下文丰富的基准。我们的构建流程包括：原始数据爬取，从顶级代码仓库收集153.7K个问题和拉取请求；全面上下文提取，链接问题-PR对以获取文本上下文，并提取完整的周围函数或类以获取代码上下文；以及多阶段数据过滤，结合基于规则和基于LLM的验证，以移除过时、格式错误或低价值的样本，最终得到67,910个上下文丰富的条目。ContextCRBench支持三种与审查工作流一致的评估场景：代码块级质量评估、行级缺陷定位和行级评论生成。对八个领先的LLM（四个闭源和四个开源）的评估表明，文本上下文比仅使用代码上下文带来更大的性能提升，而当前的LLM距离人类级别的审查能力仍有很大差距。ContextCRBench已在字节跳动部署，驱动了一个自我演进的代码审查系统，将性能提升了61.98%，并证明了其鲁棒性和工业实用性。https://github.com/kinesiatricssxilm14/ContextCRBench。