Automating Low-Risk Code Review at Meta: RADAR, Risk Calibration, and Review Efficiency

Chris Adams,Arjun Singh Banga,Parveen Bansal,Souvik Bhattacharya,Rujin Cao,Pedro Canahuati,Nate Cook,Brian Ellis,Prabhakar Goyal,Gurinder Grewal,Tianyu He,Matt Labunka,Alex Manners,David Molnar,Ging Cee Ng,Vishal Parekh,Jiefu Pei,Frederic Sagnes,James Saindon,Will Shackleton,Sid Sidhu,Gursharan Singh,Karthik Chengayan Sridhar,Matt Steiner,Pratibha Udmalpet,Sean Xia,Stacey Yan,Audris Mockus,Peter Rigby,Nachiappan Nagappan

AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.

翻译：AI辅助编程工具改变了软件生产方式。在Meta，每个由人类提交的差异中，由AI生成的代码行数同比增长105.9%，每位开发者的差异提交量增长51%，其中AI代理贡献了超80%的增长。与此同时，获得及时审查的差异比例持续下降，暴露出代码供应量与审查者带宽之间日益扩大的差距。我们提出三个层层递进的问题：从可行性验证到校准再到影响评估——（1）风险分层自动化能否在跨组织的规模化场景中运行；（2）调整风险阈值如何影响自动化产出与安全性之间的权衡；（3）自动审查能在多大程度上降低AI生成变更的端到端延迟？我们部署了RADAR（风险感知差异自动审查系统），这是一个多阶段漏斗流程：根据作者身份和来源类型对每个差异进行分类，依次通过资格门控、静态启发式规则、机器学习差异风险评分、基于LLM的自动代码审查，以及在合入前进行确定性验证。我们通过涵盖53.5万+次RADAR审查差异的遥测数据、政策变更的前后观测对比，以及效率结果的差异分析来评估RADAR。该系统已审查53.5万+差异并合入33.1万+。将差异风险评分阈值从第25百分位放宽至第50百分位后，批准率提升至60.31%。RADAR审查差异的回滚率仅为非RADAR差异的1/3，生产事故率仅为1/50。RADAR使差异中位关闭时间缩短超330%，差异审查中位耗时缩短35%。风险感知的分层自动化能在不牺牲生产安全的前提下，有效缓解AI驱动代码增长带来的审查瓶颈。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

代码即代理基础设施：迈向可执行、可验证、有状态的AI代理系统

专知会员服务

17+阅读 · 5月20日

构建面向终端的 AI 编程智能体：脚手架、测试环境、上下文工程及实践经验

专知会员服务

25+阅读 · 3月8日

AI生成代码缺陷综述

专知会员服务

17+阅读 · 2025年12月8日

文本、视觉与语音生成的自动化评估方法综述

专知会员服务

20+阅读 · 2025年6月15日