Pull request (PR) review is essential for ensuring software quality, yet automating this task remains challenging due to noisy supervision, limited contextual understanding, and inadequate evaluation metrics. We present Sphinx, a unified framework for LLM-based PR review that addresses these limitations through three key components: (1) a structured data generation pipeline that produces context-rich, semantically grounded review comments by comparing pseudo-modified and merged code; (2) a checklist-based evaluation benchmark that assesses review quality based on structured coverage of actionable verification points, moving beyond surface-level metrics like BLEU; and (3) Checklist Reward Policy Optimization (CRPO), a novel training paradigm that uses rule-based, interpretable rewards to align model behavior with real-world review practices. Extensive experiments show that models trained with Sphinx achieve state-of-the-art performance on review completeness and precision, outperforming both proprietary and open-source baselines by up to 40\% in checklist coverage. Together, Sphinx enables the development of PR review models that are not only fluent but also context-aware, technically precise, and practically deployable in real-world development workflows. The data will be released after review.
翻译:代码审查对于保障软件质量至关重要,但由于监督信号噪声大、上下文理解有限以及评估指标不足,该任务的自动化仍面临挑战。本文提出Sphinx,一个基于大语言模型的统一代码审查框架,通过三个关键组件解决上述局限:(1)结构化数据生成流水线,通过对比伪修改代码与已合并代码,生成上下文丰富、语义可溯源的审查意见;(2)基于检查清单的评估基准,通过结构化覆盖可操作的验证要点来评估审查质量,超越BLEU等表层指标;(3)检查清单奖励策略优化(CRPO),一种新颖的训练范式,利用基于规则、可解释的奖励机制使模型行为与实际审查实践对齐。大量实验表明,采用Sphinx训练的模型在审查完整性与精确度上达到最先进水平,其检查清单覆盖率较专有模型和开源基线模型提升最高达40%。Sphinx框架使得开发的代码审查模型不仅语言流畅,而且具备上下文感知能力、技术精确性,并能实际部署于真实开发工作流中。相关数据将在评审后公开。