基于LLM的纠缠代码变更检测：构建更高质量方法级缺陷数据集 (LLM-Based Detection of Tangled Code Changes for Higher-Quality Method-Level Bug Datasets)

Tangled code changes, commits that conflate unrelated modifications such as bug fixes, refactorings, and enhancements, introduce significant noise into bug datasets and adversely affect the performance of bug prediction models. Addressing this issue at a fine-grained, method-level granularity remains unexplored. This is critical to address, as recent bug prediction models, driven by practitioner demand, are increasingly focusing on finer granularity rather than traditional class- or file-level predictions. This study investigates the utility of Large Language Models (LLMs) for detecting tangled code changes by leveraging both commit messages and method-level code diffs. We formulate the problem as a binary classification task and evaluate multiple prompting strategies, including zero-shot, few-shot, and chain-of-thought prompting, using state-of-the-art proprietary LLMs such as GPT-5 and Gemini-2.0-Flash, and open-source models such as GPT-OSS-120B and CodeBERT. Our results demonstrate that combining commit messages with code diffs significantly enhances model performance, with the combined few-shot and chain-of-thought prompting achieving an F1-score of 0.883. Additionally, we explore machine learning models trained on LLM-generated embeddings, where a multi-layer perceptron classifier achieves superior performance (F1-score: 0.906, MCC: 0.807). Applying our approach to 49 open-source projects improves the distributional separability of code metrics between buggy and non-buggy methods, demonstrating the promise of LLMs for method-level commit untangling and potentially contributing to improving the accuracy of future bug prediction models.

翻译：纠缠代码变更指提交中混杂了无关修改（如缺陷修复、重构和功能增强），这会给缺陷数据集引入显著噪声，并负面影响缺陷预测模型的性能。在细粒度的方法级别上解决此问题尚未得到探索。由于实践需求驱动，近期缺陷预测模型日益关注更细粒度（而非传统的类或文件级别预测），因此解决此问题至关重要。本研究探究了利用大型语言模型（LLMs）检测纠缠代码变更的效用，通过结合提交信息与方法级代码差异。我们将该问题形式化为二元分类任务，并评估了多种提示策略，包括零样本、少样本和思维链提示，使用的模型包括前沿的专有LLMs（如GPT-5和Gemini-2.0-Flash）以及开源模型（如GPT-OSS-120B和CodeBERT）。结果表明，将提交信息与代码差异结合能显著提升模型性能，其中结合少样本与思维链提示的F1分数达到0.883。此外，我们探索了基于LLM生成嵌入训练的机器学习模型，其中多层感知机分类器取得了更优性能（F1分数：0.906，MCC：0.807）。将我们的方法应用于49个开源项目后，缺陷方法与无缺陷方法间代码度量的分布可分性得到改善，这证明了LLMs在方法级提交解纠缠方面的潜力，并可能有助于提升未来缺陷预测模型的准确性。