AI生成拉取请求中评审工作量的早期预测 (Early-Stage Prediction of Review Effort in AI-Generated Pull Requests)

As autonomous AI agents transition from code completion tools to full-fledged teammates capable of opening pull requests (PRs) at scale, software maintainers face a new challenge: not just reviewing code, but managing complex interaction loops with non-human contributors. This paradigm shift raises a critical question: can we predict which agent-generated PRs will consume excessive review effort before any human interaction begins? Analyzing 33,707 agent-authored PRs from the AIDev dataset across 2,807 repositories, we uncover a striking two-regime behavioral pattern that fundamentally distinguishes autonomous agents from human developers. The first regime, representing 28.3 percent of all PRs, consists of instant merges (less than 1 minute), reflecting success on narrow automation tasks. The second regime involves iterative review cycles where agents frequently stall or abandon refinement (ghosting). We propose a Circuit Breaker triage model that predicts high-review-effort PRs (top 20 percent) at creation time using only static structural features. A LightGBM model achieves AUC 0.957 on a temporal split, while semantic text features (TF-IDF, CodeBERT) provide negligible predictive value. At a 20 percent review budget, the model intercepts 69 percent of total review effort, enabling zero-latency governance. Our findings challenge prevailing assumptions in AI-assisted code review: review burden is dictated by what agents touch, not what they say, highlighting the need for structural governance mechanisms in human-AI collaboration.

翻译：随着自主AI代理从代码补全工具转变为能够大规模开启拉取请求（PRs）的成熟团队成员，软件维护者面临新的挑战：不仅要评审代码，还需管理与非人类贡献者的复杂交互循环。这一范式转变引发了一个关键问题：能否在任何人机交互开始前，预测哪些由AI代理生成的PR将消耗过量的评审工作量？通过分析来自AIDev数据集中2,807个代码库的33,707个由AI代理撰写的PR，我们发现了一个显著的双模态行为模式，从根本上将自主代理与人类开发者区分开来。第一种模态（占所有PR的28.3%）表现为即时合并（少于1分钟），反映了在狭窄自动化任务上的成功。第二种模态涉及迭代式评审循环，其中代理经常停滞或放弃代码优化（幽灵式弃置）。我们提出了一种断路器分诊模型，该模型仅使用静态结构特征即可在创建时预测高评审工作量PR（前20%）。LightGBM模型在时间分割测试中达到AUC 0.957，而语义文本特征（TF-IDF、CodeBERT）的预测价值可忽略不计。在20%的评审预算约束下，该模型可拦截69%的总评审工作量，实现零延迟治理。我们的研究结果挑战了当前AI辅助代码评审中的普遍假设：评审负担由代理修改的内容决定，而非其描述内容，这凸显了在人机协作中建立结构化治理机制的必要性。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

AI生成代码缺陷综述

专知会员服务

16+阅读 · 2025年12月8日

美智库《获取生成式人工智能以提升美国防部影响力活动效能》最新报告

专知会员服务

20+阅读 · 2025年7月23日

《人工智能辅助决策中信任的时间演化》225页

专知会员服务

24+阅读 · 2025年5月12日

AI行业专题报告：工具生态逐步完善，通用Agent曙光已现

专知会员服务

32+阅读 · 2025年3月27日