AI-assisted coding tools have altered software production. At Meta, significant lines of code per human-landed diff grew by 105.9% year over year and per-developer diff volume rose 51%, with agentic AI responsible for over 80% of that growth. Meanwhile, the share of diffs receiving timely review has declined, exposing a widening gap between code supply and reviewer bandwidth. We ask three questions that progress from feasibility through calibration to impact: (1) can risk-stratified automation operate at scale across diverse organizations, (2) how does tuning the risk threshold affect the trade-off between automation yield and safety, and (3) to what extent does automated review reduce end-to-end latency for AI-generated changes? We deployed RADAR (Risk Aware Diff Auto Review), a multi-stage funnel that classifies each diff by authorship and source type, applies eligibility gates, static heuristics, a machine-learned Diff Risk Score, LLM-based Automated Code Review, and deterministic validation before landing qualifying changes. We evaluate RADAR through telemetry covering 535K+ RADAR-reviewed diffs, observational before-after comparisons for policy changes, and difference-in-differences analysis of efficiency outcomes. RADAR has reviewed 535K+ diffs and landed 331K+. Relaxing the Diff Risk Score threshold from the 25th to the 50th percentile increased the approve rate to 60.31%. The revert rate for RADAR-reviewed diffs is 1/3 that of non-RADAR diffs, and the Production Incident rate is 1/50 that of non-RADAR diffs. RADAR reduces median time to close by over 330% and median diff review wall time by 35%. Risk-aware layered automation can materially reduce review bottlenecks created by AI-driven code growth without compromising production safety.
翻译:AI辅助编程工具改变了软件生产方式。在Meta,每个由人类提交的差异中,由AI生成的代码行数同比增长105.9%,每位开发者的差异提交量增长51%,其中AI代理贡献了超80%的增长。与此同时,获得及时审查的差异比例持续下降,暴露出代码供应量与审查者带宽之间日益扩大的差距。我们提出三个层层递进的问题:从可行性验证到校准再到影响评估——(1)风险分层自动化能否在跨组织的规模化场景中运行;(2)调整风险阈值如何影响自动化产出与安全性之间的权衡;(3)自动审查能在多大程度上降低AI生成变更的端到端延迟?我们部署了RADAR(风险感知差异自动审查系统),这是一个多阶段漏斗流程:根据作者身份和来源类型对每个差异进行分类,依次通过资格门控、静态启发式规则、机器学习差异风险评分、基于LLM的自动代码审查,以及在合入前进行确定性验证。我们通过涵盖53.5万+次RADAR审查差异的遥测数据、政策变更的前后观测对比,以及效率结果的差异分析来评估RADAR。该系统已审查53.5万+差异并合入33.1万+。将差异风险评分阈值从第25百分位放宽至第50百分位后,批准率提升至60.31%。RADAR审查差异的回滚率仅为非RADAR差异的1/3,生产事故率仅为1/50。RADAR使差异中位关闭时间缩短超330%,差异审查中位耗时缩短35%。风险感知的分层自动化能在不牺牲生产安全的前提下,有效缓解AI驱动代码增长带来的审查瓶颈。