LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. We investigate whether agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. To this end, we construct CodeTaste, a benchmark mined from large multi-file open-source refactorings. To score solutions, we combine repository test suites that measure functional correctness with tailored static checks that verify removal of undesired and introduction of desired code patterns using dataflow reasoning. Our results show a clear gap: agents perform well at implementing refactorings that are specified in detail, but often fail to discover the human refactoring choices when given a focus area for changes. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases. We release the benchmark, leaderboard, and code.
翻译:摘要:大语言模型编码智能体能够生成可运行的代码,但其解决方案通常积累复杂性、重复性和架构债务。人类开发者通过重构(即保持行为不变的程序转换,旨在改进代码结构与可维护性)来解决此类问题。我们研究智能体能否(i)可靠地执行重构操作,以及(ii)识别出人类开发者在实际代码库中实际选择的重构模式。为此,我们构建了CodeTaste基准数据集,该数据集源自大型多文件开源重构项目。为评估解决方案,我们结合了代码库测试套件(用于衡量功能正确性)与定制静态检查(通过数据流推理验证不良模式的移除及目标模式的引入)。实验结果表明存在显著差距:智能体在实现详细定义的重构时表现良好,但当仅给定变更焦点区域时,往往无法发现人类的重构选择。采用"提议-实施"分解策略可提升对齐度,而实施前选择最优对齐提案能进一步改善效果。CodeTaste为对齐编码智能体与真实代码库中人类重构决策提供了评估目标及潜在偏好信号。我们公开了基准数据集、排行榜及代码。