Taming Timeout Flakiness: An Empirical Study of SAP HANA

from arxiv, 12 pages, 9 figures, 3 tables, Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE SEIP 2024)

Regression testing aims to prevent code changes from breaking existing features. Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes, thus providing an ambiguous signal. Test timeouts are one contributing factor to such flaky test failures. With the goal of reducing test flakiness in SAP HANA, we empirically study the impact of test timeouts on flakiness in system tests. We evaluate different approaches to automatically adjust timeout values, assessing their suitability for reducing execution time costs and improving build turnaround times. We collect metadata on SAP HANA's test executions by repeatedly executing tests on the same code revision over a period of six months. We analyze the test flakiness rate, investigate the evolution of test timeout values, and evaluate different approaches for optimizing timeout values. The test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions. Test timeouts account for 70% of flaky test failures. Developers typically react to flaky timeouts by manually increasing timeout values or splitting long-running tests. However, manually adjusting timeout values is a tedious task. Our approach for timeout optimization reduces timeout-related flaky failures by 80% and reduces the overall median timeout value by 25%, i.e., blocked tests are identified faster. Test timeouts are a major contributing factor to flakiness in system tests. It is challenging for developers to effectively mitigate this problem manually. Our technique for optimizing timeout values reduces flaky failures while minimizing test costs. Practitioners working on large-scale industrial software systems can use our findings to increase the effectiveness of their system tests while reducing the burden on developers to manually maintain appropriate timeout values.

翻译：回归测试旨在防止代码变更破坏现有功能。不稳定的测试会对回归测试产生负面影响，因为它们导致的测试失败不一定由代码变更引起，从而提供模糊的信号。测试超时是导致此类不稳定测试失败的因素之一。为了减少 SAP HANA 中的测试不稳定性，我们实证研究了系统测试中测试超时对不稳定性的影响。我们评估了自动调整超时值的不同方法，以评估它们在降低执行时间成本和改善构建周转时间方面的适用性。我们通过在六个月内重复执行相同代码修订版本的测试，收集了 SAP HANA 测试执行的元数据。我们分析了测试不稳定率，研究了测试超时值的演变，并评估了优化超时值的不同方法。测试不稳定率在 49% 到 70% 之间，具体取决于重复测试执行的次数。测试超时占不稳定测试失败的 70%。开发人员通常通过手动增加超时值或拆分长时间运行的测试来应对不稳定的超时。然而，手动调整超时值是一项繁琐的任务。我们的超时优化方法将超时相关的不稳定失败减少了 80%，并将总体中位超时值降低了 25%，即被阻塞的测试能被更快地识别。测试超时是系统测试不稳定性的主要促成因素。开发人员手动有效缓解此问题具有挑战性。我们的超时值优化技术减少了不稳定失败，同时最小化了测试成本。从事大规模工业软件系统工作的从业者可以利用我们的发现来提高其系统测试的有效性，同时减轻开发人员手动维护适当超时值的负担。