With the rapid adoption of large language models (LLMs) in automated code refactoring, assessing and ensuring functional equivalence between LLM-generated refactoring and the original implementation becomes critical. While prior work typically relies on predefined test cases to evaluate correctness, in this work, we leverage differential fuzzing to check functional equivalence in LLM-generated code refactorings. Unlike test-based evaluation, a differential fuzzing-based equivalence checker needs no predefined test cases and can explore a much larger input space by executing and comparing thousands of automatically generated test inputs. In a large-scale evaluation of six LLMs (CodeLlama, Codestral, StarChat2, Qwen-2.5, Olmo-3, and GPT-4o) across three datasets and two refactoring types, we find that LLMs show a non-trivial tendency to alter program semantics, producing 19-35% functionally non-equivalent refactorings. Our experiments further demonstrate that about 21% of these non-equivalent refactorings remain undetected by the existing test suites of the three evaluated datasets. Collectively, the findings of this study imply that reliance on existing tests might overestimate functional equivalence in LLM-generated code refactorings, which remain prone to semantic divergence.
翻译:随着大语言模型(LLM)在自动化代码重构中的快速应用,评估并确保LLM生成的重构代码与原始实现之间的功能等价性变得至关重要。现有研究通常依赖预定义的测试用例来评估正确性,而本研究则利用差分模糊测试来检验LLM生成代码重构中的功能等价性。与基于测试的评估方法不同,基于差分模糊测试的等价性检查器无需预定义测试用例,可通过执行并比较数千个自动生成的测试输入来探索更大的输入空间。通过对六个大语言模型(CodeLlama、Codestral、StarChat2、Qwen-2.5、Olmo-3和GPT-4o)在三个数据集和两种重构类型上进行大规模评估,我们发现LLM表现出显著改变程序语义的倾向,产生了19%-35%功能不等价的重构代码。实验进一步表明,这些不等价重构中约有21%未被三个评估数据集的现有测试套件检测到。综合而言,本研究结果表明,依赖现有测试可能会高估LLM生成代码重构中的功能等价性,这些重构代码仍然容易产生语义偏差。