Maintaining and scaling software systems relies heavily on effective code refactoring, yet this process remains labor-intensive, requiring developers to carefully analyze existing codebases and prevent the introduction of new defects. Although recent advancements have leveraged Large Language Models (LLMs) to automate refactoring tasks, current solutions are constrained in scope and lack mechanisms to guarantee code compilability and successful test execution. In this work, we introduce MANTRA, a comprehensive LLM agent-based framework that automates method-level refactoring. MANTRA integrates Context-Aware Retrieval-Augmented Generation, coordinated Multi-Agent Collaboration, and Verbal Reinforcement Learning to emulate human decision-making during refactoring while preserving code correctness and readability. Our empirical study, conducted on 703 instances of "pure refactorings" (i.e., code changes exclusively involving structural improvements), drawn from 10 representative Java projects, covers the six most prevalent refactoring operations. Experimental results demonstrate that MANTRA substantially surpasses a baseline LLM model (RawGPT ), achieving an 82.8% success rate (582/703) in producing code that compiles and passes all tests, compared to just 8.7% (61/703) with RawGPT. Moreover, in comparison to IntelliJ's LLM-powered refactoring tool (EM-Assist), MANTRA exhibits a 50% improvement in generating Extract Method transformations. A usability study involving 37 professional developers further shows that refactorings performed by MANTRA are perceived to be as readable and reusable as human-written code, and in certain cases, even more favorable. These results highlight the practical advantages of MANTRA and emphasize the growing potential of LLM-based systems in advancing the automation of software refactoring tasks.
翻译:软件系统的维护与扩展在很大程度上依赖于有效的代码重构,然而这一过程仍然高度依赖人工,需要开发者仔细分析现有代码库并避免引入新的缺陷。尽管近期研究已利用大语言模型(LLM)实现重构任务的自动化,但现有方案在适用范围上存在局限,且缺乏保障代码可编译性和测试通过率的机制。本研究提出MANTRA,一个基于LLM智能体的综合性框架,用于实现方法级重构的自动化。MANTRA整合了上下文感知检索增强生成、协同多智能体协作与语言强化学习,在模拟人类重构决策过程的同时保持代码的正确性与可读性。我们通过对10个代表性Java项目中提取的703个“纯粹重构”(即仅涉及结构改进的代码变更)实例进行实证研究,覆盖了六种最常见的重构操作。实验结果表明,MANTRA显著超越基线LLM模型(RawGPT),在生成可编译且通过全部测试的代码方面达到82.8%的成功率(582/703),而RawGPT仅为8.7%(61/703)。此外,与IntelliJ的LLM驱动重构工具(EM-Assist)相比,MANTRA在生成提取方法转换方面的性能提升达50%。一项涉及37位专业开发者的可用性研究进一步表明,MANTRA执行的重构在可读性和可复用性方面被认为与人工编写的代码相当,在某些情况下甚至更受青睐。这些结果凸显了MANTRA的实践优势,并强调了基于LLM的系统在推进软件重构任务自动化方面的巨大潜力。