Popular IDEs frequently contain bugs in their refactoring implementations. Ensuring that a transformation preserves a program's behavior is a complex task. Traditional detection methods rely on predefined preconditions for each refactoring type, limiting their scalability and adaptability to new transformations. These methods often require extensive static and dynamic analyses, which are computationally expensive, time-consuming, and may still fail to detect certain refactoring bugs. This study evaluates the effectiveness of Small Language Models (SLMs) in detecting two types of refactoring bugs in Java and Python: (i) transformations that introduce errors or behavioral changes (Type I) and (ii) transformations unnecessarily blocked by IDEs despite being valid (Type II). We assess whether Llama 3.2 3B, Mistral 7B, Gemma 2 9B, DeepSeek-R1 14B, Phi-4 14B, o1-mini, and o3-mini-high can accurately detect 100 refactoring bugs reported in widely used Java and Python IDEs, such as Eclipse and NetBeans. The study covers 16 refactoring types and employs zero-shot prompting on consumer-grade hardware to evaluate the models' ability to reason about refactoring correctness without explicit prior training. The proprietary o3-mini-high model achieved the highest detection rate, identifying 84.3% of Type I bugs. The open-source Phi-4 14B performed comparably well, demonstrating strong effectiveness across both bug types. However, o3-mini-high struggled with Type II bugs, correctly identifying and applying valid but blocked transformations in only 40% of cases. The findings highlight the potential of SLMs for efficiently detecting refactoring bugs, particularly in verifying behavioral changes. Additionally, SLMs offer a more adaptable solution capable of generalizing across different refactoring types and programming languages, addressing key limitations of traditional approaches.
翻译:主流集成开发环境(IDE)的重构实现中常存在缺陷。确保程序变换不改变其行为是一项复杂的任务。传统检测方法依赖于为每种重构类型预定义的前提条件,这限制了其可扩展性及对新变换的适应性。这些方法通常需要大量静态与动态分析,计算成本高、耗时久,且仍可能遗漏某些重构缺陷。本研究评估了小型语言模型(SLMs)在检测Java和Python中两类重构缺陷的有效性:(i)引入错误或行为改变的变换(类型I);(ii)IDE不必要地阻止了有效变换的情况(类型II)。我们评估了Llama 3.2 3B、Mistral 7B、Gemma 2 9B、DeepSeek-R1 14B、Phi-4 14B、o1-mini及o3-mini-high模型能否准确检测出Eclipse和NetBeans等广泛使用的Java与Python IDE中报告的100个重构缺陷。研究覆盖16种重构类型,并在消费级硬件上采用零样本提示方法,以评估模型在未经显式先验训练的情况下推理重构正确性的能力。专有模型o3-mini-high取得了最高的检测率,识别出84.3%的类型I缺陷。开源模型Phi-4 14B表现相当出色,在两类缺陷检测中均展现出强大效能。然而,o3-mini-high在处理类型II缺陷时存在困难,仅能在40%的情况下正确识别并应用被IDE阻止的有效变换。研究结果凸显了SLMs在高效检测重构缺陷方面的潜力,特别是在验证行为改变方面。此外,SLMs提供了一种更具适应性的解决方案,能够泛化至不同的重构类型和编程语言,从而解决了传统方法的关键局限性。