The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
翻译:人工智能驱动的编程助手正迅速普及,深刻改变着软件开发实践,然而针对不同任务类型及随时间推移的系统性效能对比研究仍显不足。本文通过实证研究比较了五种主流AI编程助手(OpenAI Codex、GitHub Copilot、Devin、Cursor和Claude Code),基于AIDev数据集分析了7,156个Pull Request(PR)。时序趋势分析揭示了异质性的演化模式:Devin是唯一呈现持续正向接受率趋势的助手(32周内每周+0.77%),而其他助手则基本保持稳定。我们的分析表明,PR任务类型是影响接受率的主导因素:文档类任务接受率达82.1%,而新功能类任务仅为66.1%——这一16个百分点的差距超过了多数任务中典型的助手间差异。OpenAI Codex在所有九类任务中均保持较高接受率(59.6%-88.6%),分层卡方检验证实其在多个任务类别中具有统计学显著优势。然而,没有单一助手能在所有任务类型中表现最优:Claude Code在文档任务(92.3%)和功能开发(72.6%)中领先,而Cursor在修复任务(80.4%)中表现突出。