Flaky tests yield different results when executed multiple times for the same version of the source code. Thus, they provide an ambiguous signal about the quality of the code and interfere with the automated assessment of code changes. While a variety of factors can cause test flakiness, approaches to fix flaky tests are typically tailored to address specific causes. However, the prevalent root causes of flaky tests can vary depending on the programming language, application domain, or size of the software project. Since manually labeling flaky tests is time-consuming and tedious, this work proposes an LLMs-as-annotators approach that leverages intra- and inter-model consistency to label issue reports related to fixed flakiness issues with the relevant root cause category. This allows us to gain an overview of prevalent flakiness categories in the issue reports. We evaluated our labeling approach in the context of SAP HANA, a large industrial database management system. Our results suggest that SAP HANA's tests most commonly suffer from issues related to concurrency (23%, 130 of 559 analyzed issue reports). Moreover, our results suggest that different test types face different flakiness challenges. Therefore, we encourage future research on flakiness mitigation to consider evaluating the generalizability of proposed approaches across different test types.
翻译:不稳定测试在针对同一版本源代码多次执行时会产生不同的结果。因此,它们对代码质量提供了模糊的信号,并干扰了代码变更的自动化评估。虽然多种因素可能导致测试不稳定性,但修复不稳定测试的方法通常针对特定原因进行定制。然而,不稳定测试的主要根源可能因编程语言、应用领域或软件项目规模而异。由于手动标注不稳定测试耗时且繁琐,本研究提出了一种“大语言模型作为标注者”的方法,该方法利用模型内部与模型间的一致性,将与已修复不稳定性问题相关的问题报告标注至对应的根源类别。这使得我们能够概览问题报告中普遍存在的不稳定性类别。我们在大型工业数据库管理系统 SAP HANA 的背景下评估了我们的标注方法。我们的结果表明,SAP HANA 的测试最常遇到的问题与并发性相关(在分析的 559 份问题报告中占 23%,即 130 份)。此外,我们的结果表明,不同的测试类型面临不同的不稳定性挑战。因此,我们鼓励未来关于缓解不稳定性的研究,应考虑评估所提方法在不同测试类型间的泛化能力。