The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
翻译:人工智能驱动的编码助手的快速采用正在改变软件开发实践,然而对其在不同任务类型及时间维度上有效性的系统性比较仍然有限。本文提出一项实证研究,比较了五种主流代理(OpenAI Codex、GitHub Copilot、Devin、Cursor与Claude Code),分析了来自AIDev数据集的7,156个拉取请求(PR)。时间趋势分析揭示了异质性演化模式:Devin是唯一在接受率上呈现持续正向趋势的代理(32周内每周增加0.77%),而其他代理基本保持稳定。我们的分析表明,PR任务类型是影响接受率的主导因素:文档任务的接受率为82.1%,而新功能任务为66.1%——这一16个百分点的差距超过多数任务中典型代理间差异。OpenAI Codex在所有九个任务类别中均保持高接受率(59.6%-88.6%),分层卡方检验确认其在多个任务类别中具有统计显著优势。然而,没有任何单一代理在所有任务类型中表现最佳:Claude Code在文档(92.3%)和功能(72.6%)任务上领先,而Cursor在修复任务中表现优异(80.4%)。