Refactoring is a software engineering practice that aims to improve code quality without altering program behavior. Although automated refactoring tools have been extensively studied, their practical applicability remains limited. Recent advances in Large Language Models (LLMs) have introduced new opportunities for automated code refactoring. The evaluation of such an LLM-driven approach, however, leaves unanswered questions about its effects on code quality. In this paper, we present a comprehensive empirical study on LLM-driven refactoring using GPT-4o, applied to 100 Python classes from the ClassEval benchmark. Unlike prior work, our study explores a wide range of class-level refactorings inspired by Fowler's catalog and evaluates their effects from three complementary perspectives: (i) behavioral correctness, verified through unit tests; (ii) code quality, assessed via Pylint, Flake8, and SonarCloud; and (iii) readability, measured using a state-of-the-art readability tool. Our findings show that GPT-4o generally produces behavior-preserving refactorings that reduce code smells and improve quality metrics, albeit at the cost of decreased readability. Our results provide new evidence on the capabilities and limitations of LLMs in automated software refactoring, highlighting directions for integrating LLMs into practical refactoring workflows.
翻译:重构是一种旨在提升代码质量而不改变程序行为的软件工程实践。尽管自动化重构工具已得到广泛研究,其实际适用性仍存在局限。大型语言模型(LLMs)的最新进展为自动化代码重构带来了新的机遇。然而,对此类LLM驱动方法的评估仍未能解答其对于代码质量的实际影响。本文通过GPT-4o对ClassEval基准中的100个Python类进行重构,开展了关于LLM驱动重构的综合性实证研究。与先前研究不同,本文基于Fowler重构目录探索了广泛的类级别重构操作,并从三个互补维度评估其效果:(i)通过单元测试验证的行为正确性;(ii)借助Pylint、Flake8和SonarCloud评估的代码质量;(iii)使用前沿可读性工具量化的可读性。研究结果表明,GPT-4o在保持行为一致性的前提下,能够有效减少代码异味并提升质量指标,但会以降低可读性为代价。本研究为LLM在自动化软件重构中的能力与局限提供了新证据,并为将LLM整合至实际重构工作流指明了方向。