As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.
翻译:随着自主编程智能体的快速普及,现有评估主要聚焦于在固定目标代码库条件下的任务完成率。这遗留了一个关键问题:底层代码的结构与风格质量(即"代码整洁性")是否影响智能体的导航与修改能力?为分离代码整洁性与智能体能力的影响,我们提出了一项基于最小对偶集的评估协议:这些代码库在架构、依赖关系及外部行为上保持一致,但在静态分析违规次数与认知复杂度上存在差异。通过智能体管道对整洁代码库进行劣化处理或对杂乱代码库进行清理,我们从双向构建了这些对偶集。基于六组代码库对偶集,我们设计了33项任务,并通过应用程序公开接口的隐藏测试进行评估。在660次基于Claude Code的试验中,代码整洁性并未改变智能体的通过率。然而,它显著改变了智能体的操作开销:处理整洁代码时,智能体减少了7%至8%的令牌消耗量,并将文件重复访问次数降低34%。我们的发现表明,传统代码可维护性原则在AI驱动开发时代依然具有高度相关性,深刻影响着编程智能体的计算成本与导航效率。代码整洁性将与模型选择、工具封装及提示工程并列,成为实质影响智能体行为的关键因素。