Evaluating the Effectiveness of Small Language Models in Detecting Refactoring Bugs

Popular IDEs frequently contain bugs in their refactoring implementations. Ensuring that a transformation preserves a program's behavior is a complex task. Traditional detection methods rely on predefined preconditions for each refactoring type, limiting their scalability and adaptability to new transformations. These methods often require extensive static and dynamic analyses, which are computationally expensive, time-consuming, and may still fail to detect certain refactoring bugs. This study evaluates the effectiveness of Small Language Models (SLMs) in detecting two types of refactoring bugs in Java and Python: (i) transformations that introduce errors or behavioral changes (Type I) and (ii) transformations unnecessarily blocked by IDEs despite being valid (Type II). We assess whether Llama 3.2 3B, Mistral 7B, Gemma 2 9B, DeepSeek-R1 14B, Phi-4 14B, o1-mini, and o3-mini-high can accurately detect 100 refactoring bugs reported in widely used Java and Python IDEs, such as Eclipse and NetBeans. The study covers 16 refactoring types and employs zero-shot prompting on consumer-grade hardware to evaluate the models' ability to reason about refactoring correctness without explicit prior training. The proprietary o3-mini-high model achieved the highest detection rate, identifying 84.3% of Type I bugs. The open-source Phi-4 14B performed comparably well, demonstrating strong effectiveness across both bug types. However, o3-mini-high struggled with Type II bugs, correctly identifying and applying valid but blocked transformations in only 40% of cases. The findings highlight the potential of SLMs for efficiently detecting refactoring bugs, particularly in verifying behavioral changes. Additionally, SLMs offer a more adaptable solution capable of generalizing across different refactoring types and programming languages, addressing key limitations of traditional approaches.

翻译：主流集成开发环境（IDE）的重构实现中常存在缺陷。确保程序变换不改变其行为是一项复杂的任务。传统检测方法依赖于为每种重构类型预定义的前提条件，这限制了其可扩展性及对新变换的适应性。这些方法通常需要大量静态与动态分析，计算成本高、耗时久，且仍可能遗漏某些重构缺陷。本研究评估了小型语言模型（SLMs）在检测Java和Python中两类重构缺陷的有效性：（i）引入错误或行为改变的变换（类型I）；（ii）IDE不必要地阻止了有效变换的情况（类型II）。我们评估了Llama 3.2 3B、Mistral 7B、Gemma 2 9B、DeepSeek-R1 14B、Phi-4 14B、o1-mini及o3-mini-high模型能否准确检测出Eclipse和NetBeans等广泛使用的Java与Python IDE中报告的100个重构缺陷。研究覆盖16种重构类型，并在消费级硬件上采用零样本提示方法，以评估模型在未经显式先验训练的情况下推理重构正确性的能力。专有模型o3-mini-high取得了最高的检测率，识别出84.3%的类型I缺陷。开源模型Phi-4 14B表现相当出色，在两类缺陷检测中均展现出强大效能。然而，o3-mini-high在处理类型II缺陷时存在困难，仅能在40%的情况下正确识别并应用被IDE阻止的有效变换。研究结果凸显了SLMs在高效检测重构缺陷方面的潜力，特别是在验证行为改变方面。此外，SLMs提供了一种更具适应性的解决方案，能够泛化至不同的重构类型和编程语言，从而解决了传统方法的关键局限性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日