Large Language Models (LLMs) have shown potential to enhance software development through automated code generation and refactoring, reducing development time and improving code quality. This study empirically evaluates StarCoder2, an LLM optimized for code generation, in refactoring code across 30 open-source Java projects. We compare StarCoder2's performance against human developers, focusing on (1) code quality improvements, (2) types and effectiveness of refactorings, and (3) enhancements through one-shot and chain-of-thought prompting. Our results indicate that StarCoder2 reduces code smells by 20.1% more than developers, excelling in systematic issues like Long Statement and Magic Number, while developers handle complex, context-dependent issues better. One-shot prompting increases the unit test pass rate by 6.15% and improves code smell reduction by 3.52%. Generating five refactorings per input further increases the pass rate by 28.8%, suggesting that combining one-shot prompting with multiple refactorings optimizes performance. These findings provide insights into StarCoder2's potential and best practices for integrating LLMs into software refactoring, supporting more efficient and effective code improvement in real-world applications.
翻译:大型语言模型(LLMs)通过自动化代码生成与重构,在降低开发时间、提升代码质量方面展现出增强软件开发的潜力。本研究实证评估了专为代码生成优化的LLM模型StarCoder2在30个开源Java项目中的代码重构表现。我们将StarCoder2的性能与人类开发者进行对比,重点关注:(1)代码质量改进程度,(2)重构类型及其有效性,以及(3)通过单样本提示和思维链提示带来的性能提升。实验结果表明,StarCoder2比开发者多消除20.1%的代码异味,在"长语句"和"魔法数字"等系统性问题上表现优异,而开发者更擅长处理复杂且依赖上下文的代码问题。单样本提示使单元测试通过率提升6.15%,代码异味消除率提高3.52%。每个输入生成五种重构方案可进一步将测试通过率提升28.8%,这表明结合单样本提示与多重重构能优化模型性能。这些发现为理解StarCoder2的潜力以及将LLMs整合到软件重构中的最佳实践提供了见解,有助于在实际应用中实现更高效、更优质的代码改进。