Autonomous coding agents (e.g., OpenAI Codex, Devin, GitHub Copilot) are increasingly used to generate fix-related pull requests (PRs) in real world software repositories. However, their practical effectiveness depends on whether these contributions are accepted and merged by project maintainers. In this paper, we present an empirical study of AI agent involved fix related PRs, examining both their integration outcomes, latency, and the factors that hinder successful merging. We first analyze 8,106 fix related PRs authored by five widely used AI coding agents from the AIDEV POP dataset to quantify the proportions of PRs that are merged, closed without merging, or remain open. We then conduct a manual qualitative analysis of a statistically significant sample of 326 closed but unmerged PRs, spending approximately 100 person hours to construct a structured catalog of 12 failure reasons. Our results indicate that test case failures and prior resolution of the same issues by other PRs are the most common causes of non integration, whereas build or deployment failures are comparatively rare. Overall, our findings expose key limitations of current AI coding agents in real world settings and highlight directions for their further improvement and for more effective human AI collaboration in software maintenance.
翻译:自主编码代理(例如 OpenAI Codex、Devin、GitHub Copilot)在现实世界软件仓库中越来越多地被用于生成修复相关的拉取请求。然而,其实践有效性取决于这些贡献是否被项目维护者接受并合并。本文对 AI 代理参与的修复相关 PR 进行了一项实证研究,考察了其集成结果、延迟以及阻碍成功合并的因素。我们首先分析了来自 AIDEV POP 数据集的、由五个广泛使用的 AI 编码代理撰写的 8,106 个修复相关 PR,以量化被合并、未合并即关闭或仍处于开放状态的 PR 比例。随后,我们对一个具有统计显著性的、包含 326 个已关闭但未合并 PR 的样本进行了人工定性分析,花费约 100 人时构建了一个包含 12 类失败原因的结构化目录。我们的结果表明,测试用例失败以及同一问题已被其他 PR 解决是最常见的未集成原因,而构建或部署失败则相对罕见。总体而言,我们的研究结果揭示了当前 AI 编码代理在现实环境中的关键局限性,并为其进一步改进以及软件维护中更有效的人机协作指明了方向。