仅凭测试代码能否分类不稳定的测试？一项基于大语言模型的实证研究 (Can We Classify Flaky Tests Using Only Test Code? An LLM-Based Empirical Study)

from arxiv, 10 pages, 3 figures, 7 tables, 33rd IEEE International Conference on Software Analysis, Evolution and Reengineering: Reproducibility Studies and Negative Results (SANER-RENE 2025)

Flaky tests yield inconsistent results when they are repeatedly executed on the same code revision. They interfere with automated quality assurance of code changes and hinder efficient software testing. Previous work evaluated approaches to train machine learning models to classify flaky tests based on identifiers in the test code. However, the resulting classifiers have been shown to lack generalizability, hindering their applicability in practical environments. Recently, pre-trained Large Language Models (LLMs) have shown the capability to generalize across various tasks. Thus, they represent a promising approach to address the generalizability problem of previous approaches. In this study, we evaluated three LLMs (two general-purpose models, one code-specific model) using three prompting techniques on two benchmark datasets from prior studies on flaky test classification. Furthermore, we manually investigated 50 samples from the given datasets to determine whether classifying flaky tests based only on test code is feasible for humans. Our findings indicate that LLMs struggle to classify flaky tests given only the test code. The results of our best prompt-model combination were only marginally better than random guessing. In our manual analysis, we found that the test code does not necessarily contain sufficient information for a flakiness classification. Our findings motivate future work to evaluate LLMs for flakiness classification with additional context, for example, using retrieval-augmented generation or agentic AI.

翻译：不稳定的测试在相同代码版本上重复执行时会产生不一致的结果。它们干扰代码变更的自动化质量保证，并阻碍高效的软件测试。先前的研究评估了基于测试代码中的标识符训练机器学习模型以分类不稳定测试的方法。然而，这些分类器已被证明缺乏泛化能力，阻碍了其在实际环境中的适用性。近年来，预训练的大语言模型（LLMs）展现出跨任务泛化的能力。因此，它们为解决先前方法的泛化问题提供了有前景的途径。在本研究中，我们使用三种提示技术，在两个先前不稳定测试分类研究的基准数据集上评估了三种大语言模型（两种通用模型，一种代码专用模型）。此外，我们手动分析了给定数据集中的50个样本，以确定仅基于测试代码对人类而言是否可能进行不稳定测试分类。我们的研究结果表明，大语言模型在仅提供测试代码的情况下难以有效分类不稳定测试。最佳提示-模型组合的结果仅略优于随机猜测。在人工分析中，我们发现测试代码本身不一定包含足够的信息以支持不稳定性分类。我们的发现激励未来研究探索在附加上下文（例如使用检索增强生成或智能体式人工智能）条件下评估大语言模型进行不稳定性分类的能力。