Code Agents have achieved remarkable advances in recent years, exhibiting strong capabilities across a wide range of software engineering tasks. However, their misuse often produces bloated and disorganized code that impairing readability, extensibility, and robustness. Despite this risk, existing benchmarks largely evaluate functional correctness rather than long-term maintainability of code agents. In this paper, we propose SmellBench, an extensible code refactoring benchmark that proactively injects code smells into clean code snippets from real-world repositories. This design enables the generation of controlled, high-quality, and diverse refactoring cases with human-written ground truth. Specifically, it contains 294 cases spanning 7 popular smell types, 3 difficulty levels, 2 instruction settings across 7 real-world repositories. We further design 3 evaluation aspects covering functional correctness, localization ability, and refactoring quality assessment. Experiments with 2 popular agents and 6 large langauge models (LLMs) show that the best combination - Qwen Code + Claude Sonnet 4.5 - achieved only a 50.34 score of smell elimination. Further analysis reveals that this gap arises from a focus on local code smells and a lack of cross-file understanding, which hinders comprehensive smell elimination.
翻译:近年来,代码代理在各类软件工程任务中展现出卓越能力,取得了显著进展。然而,其不当使用常导致代码臃肿混乱,损害可读性、可扩展性与鲁棒性。尽管存在这一风险,现有基准主要评估功能正确性,而非代码代理的长期可维护性。本文提出SmellBench——一个可扩展的代码重构基准,通过主动将代码坏味注入真实仓库的干净代码片段中。该设计能够生成受控、高质量且多样化的重构案例,并附带人工撰写的标准答案。具体而言,该基准涵盖来自7个真实仓库的294个案例,涉及7种常见坏味类型、3个难度等级及2种指令设置。我们进一步设计了三维评估体系:功能正确性、缺陷定位能力与重构质量评估。对2个主流代理与6个大语言模型(LLM)的实验表明,最优组合(Qwen Code + Claude Sonnet 4.5)仅实现50.34分的坏味消除得分。进一步分析揭示,该差距源于代理偏向关注局部坏味且缺乏跨文件理解能力,这阻碍了全面的坏味消除。