Taming Timeout Flakiness: An Empirical Study of SAP HANA

from arxiv, 12 pages, 9 figures, 3 tables, Proceedings of the 46th International Conference on Software Engineering: Software Engineering in Practice (ICSE SEIP 2024)

Regression testing aims to prevent code changes from breaking existing features. Flaky tests negatively affect regression testing because they result in test failures that are not necessarily caused by code changes, thus providing an ambiguous signal. Test timeouts are one contributing factor to such flaky test failures. With the goal of reducing test flakiness in SAP HANA, we empirically study the impact of test timeouts on flakiness in system tests. We evaluate different approaches to automatically adjust timeout values, assessing their suitability for reducing execution time costs and improving build turnaround times. We collect metadata on SAP HANA's test executions by repeatedly executing tests on the same code revision over a period of six months. We analyze the test flakiness rate, investigate the evolution of test timeout values, and evaluate different approaches for optimizing timeout values. The test flakiness rate ranges from 49% to 70%, depending on the number of repeated test executions. Test timeouts account for 70% of flaky test failures. Developers typically react to flaky timeouts by manually increasing timeout values or splitting long-running tests. However, manually adjusting timeout values is a tedious task. Our approach for timeout optimization reduces timeout-related flaky failures by 80% and reduces the overall median timeout value by 25%, i.e., blocked tests are identified faster. Test timeouts are a major contributing factor to flakiness in system tests. It is challenging for developers to effectively mitigate this problem manually. Our technique for optimizing timeout values reduces flaky failures while minimizing test costs. Practitioners working on large-scale industrial software systems can use our findings to increase the effectiveness of their system tests while reducing the burden on developers to manually maintain appropriate timeout values.

翻译：回归测试旨在防止代码变更破坏既有功能。波动性测试会对回归测试产生负面影响，因为这类测试导致的失败未必由代码变更引起，从而提供含混不清的信号。测试超时正是导致此类波动性测试失败的因素之一。为减少SAP HANA中的测试波动性，我们通过实证研究分析了测试超时对系统测试波动性的影响。我们评估了多种自动调整超时值的方法，考察它们降低执行时间成本及提升构建周转时间的适用性。通过持续六个月对同一代码版本重复执行测试，我们收集了SAP HANA测试执行的元数据，分析了测试波动率，探究了测试超时值的演变规律，并评估了不同超时值优化方案。测试波动率范围为49%至70%，具体取决于重复执行测试的次数。其中，超时导致的波动性失败占比达70%。开发人员通常通过手动增加超时值或拆分长时间运行的测试来应对波动性超时，但手动调整超时值极为繁琐。我们的超时优化方法将超时相关波动性失败减少了80%，并将整体中位超时值降低了25%，从而能更快识别阻塞测试。测试超时是系统测试波动性的主要成因，开发人员有效手动缓解此问题颇具挑战。我们的超时值优化技术能在降低测试成本的同时减少波动性失败。从事大规模工业软件系统的实践者可借鉴本研究成果，在提升系统测试效率的同时减轻开发人员手动维护合理超时值的负担。