Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

Automated program repair has traditionally focused on single-hunk defects, overlooking multi-hunk bugs that are prevalent in real-world systems. Repairing these bugs requires coordinated edits across multiple, disjoint code regions, posing substantially greater challenges. We present the first systematic study of LLM-driven coding agents (Claude Code, Codex, Gemini-cli, and Qwen Code) on this task. We evaluate these four state-of-the-art agents on 404 multi-hunk bugs from the PolyHunk dataset, yielding 1,616 repair trajectories for large-scale behavioral analysis. We employ fine-grained metrics to assess localization, repair accuracy, regression behavior, and operational dynamics across agents. We find that localization capability varies substantially, with Codex achieving the highest success rate (75.3%) and Qwen Code the lowest (40.4%). Repair accuracy also differs widely, ranging from 26.98% (Qwen Code) to 92.82% (Claude Code), and consistently declines with increasing bug dispersion and complexity (hunk divergence and spatial proximity). High-performing agents (Claude Code and Codex) demonstrate superior semantic consistency, achieving positive average regression reduction, whereas lower-performing agents often introduce new test failures. Notably, agents do not fail fast; failed repairs consume substantially more resources (33%-440% more input tokens) and require longer execution time (35%-330%). Additionally, we developed Maple to provide agents with repository-level context. Empirical results show that Maple improves repair accuracy of Gemini-cli by ~21% through enhanced localization. By analyzing fine-grained metrics and trajectory-level analysis, this study moves beyond accuracy to explain how coding agents localize, reason, and act during multi-hunk repair. Our findings underscore the impact of bug divergence and spatial proximity on multi-hunk repair success for coding agents.

翻译：自动化程序修复传统上聚焦于单片段缺陷，而忽视了现实系统中普遍存在的多片段缺陷。修复这些缺陷需要在多个不连续的代码区域间进行协调修改，这带来了显著更大的挑战。我们首次系统性地研究了基于LLM的编码智能体（Claude Code、Codex、Gemini-cli和Qwen Code）在此任务上的表现。我们在PolyHunk数据集的404个多片段缺陷上评估了这四个最先进的智能体，生成了1,616条修复轨迹以进行大规模行为分析。我们采用细粒度指标评估各智能体的定位能力、修复准确性、回归行为及操作动态。研究发现，定位能力差异显著，Codex达到最高成功率（75.3%），而Qwen Code最低（40.4%）。修复准确性同样差异悬殊，范围从26.98%（Qwen Code）到92.82%（Claude Code），且随缺陷离散度与复杂性（片段发散性和空间邻近性）的增加而持续下降。高性能智能体（Claude Code和Codex）展现出更强的语义一致性，实现了正向平均回归缩减，而性能较低的智能体常引入新的测试失败。值得注意的是，智能体并未快速失败；失败的修复消耗了显著更多的资源（增加33%-440%的输入令牌）并需要更长的执行时间（增加35%-330%）。此外，我们开发了Maple以提供仓库级上下文。实验结果表明，Maple通过增强定位能力将Gemini-cli的修复准确性提升了约21%。通过分析细粒度指标和轨迹级分析，本研究超越了准确率指标，解释了编码智能体在多片段修复过程中如何定位、推理和行动。我们的发现强调了缺陷离散度和空间邻近性对编码智能体多片段修复成功的影响。