Recent advances in large language models (LLMs), make it potentially feasible to automatically refactor source code with LLMs. However, it remains unclear how well LLMs perform compared to human experts in conducting refactorings automatically and accurately. To fill this gap, in this paper, we conduct an empirical study to investigate the potential of LLMs in automated software refactoring, focusing on the identification of refactoring opportunities and the recommendation of refactoring solutions. We first construct a high-quality refactoring dataset comprising 180 real-world refactorings from 20 projects, and conduct the empirical study on the dataset. With the to-be-refactored Java documents as input, ChatGPT and Gemini identified only 28 and 7 respectively out of the 180 refactoring opportunities. However, explaining the expected refactoring subcategories and narrowing the search space in the prompts substantially increased the success rate of ChatGPT from 15.6% to 86.7%. Concerning the recommendation of refactoring solutions, ChatGPT recommended 176 refactoring solutions for the 180 refactorings, and 63.6% of the recommended solutions were comparable to (even better than) those constructed by human experts. However, 13 out of the 176 solutions suggested by ChatGPT and 9 out of the 137 solutions suggested by Gemini were unsafe in that they either changed the functionality of the source code or introduced syntax errors, which indicate the risk of LLM-based refactoring. To this end, we propose a detect-and-reapply tactic, called RefactoringMirror, to avoid such unsafe refactorings. By reapplying the identified refactorings to the original code using thoroughly tested refactoring engines, we can effectively mitigate the risks associated with LLM-based automated refactoring while still leveraging LLM's intelligence to obtain valuable refactoring recommendations.
翻译:近年来,大型语言模型(LLMs)的进展使得利用LLMs自动重构源代码成为可能。然而,与人类专家相比,LLMs在自动、准确地执行重构方面的表现仍不明确。为填补这一空白,本文开展了一项实证研究,以探究LLMs在自动化软件重构中的潜力,重点关注重构机会的识别与重构方案的推荐。我们首先构建了一个包含来自20个项目共180个真实重构的高质量重构数据集,并基于该数据集进行实证研究。以待重构的Java文档作为输入,ChatGPT和Gemini分别仅识别出180个重构机会中的28个和7个。然而,在提示中说明预期的重构子类别并缩小搜索范围,将ChatGPT的成功率从15.6%显著提升至86.7%。在重构方案推荐方面,ChatGPT为180个重构任务推荐了176个重构方案,其中63.6%的推荐方案与人类专家构建的方案相当(甚至更优)。然而,ChatGPT推荐的176个方案中有13个,以及Gemini推荐的137个方案中有9个是不安全的,因为它们要么改变了源代码的功能,要么引入了语法错误,这揭示了基于LLM的重构存在的风险。为此,我们提出了一种称为RefactoringMirror的检测与重应用策略,以避免此类不安全的重构。通过使用经过充分测试的重构引擎将识别出的重构重新应用到原始代码上,我们能够有效降低基于LLM的自动化重构所带来的风险,同时仍可利用LLM的智能获取有价值的重构建议。