The Importance of Discerning Flaky from Fault-triggering Test Failures: A Case Study on the Chromium CI

Flaky tests are tests that pass and fail on different executions of the same version of a program under test. They waste valuable developer time by making developers investigate false alerts (flaky test failures). To deal with this problem, many prediction methods that identify flaky tests have been proposed. While promising, the actual utility of these methods remains unclear since they have not been evaluated within a continuous integration (CI) process. In particular, it remains unclear what is the impact of missed faults, i.e., the consideration of fault-triggering test failures as flaky, at different CI cycles. To fill this gap, we apply state-of-the-art flakiness prediction methods at the Chromium CI and check their performance. Perhaps surprisingly, we find that, despite the high precision (99.2%) of the methods, their application leads to numerous faults missed, approximately 76.2% of all regression faults. To explain this result, we analyse the fault-triggering failures and show that flaky tests have a strong fault-revealing capability, i.e., they reveal more than 1/3 of all regression faults, indicating an inherent limitation of all methods focusing on identifying flaky tests, instead of flaky test failures. Going a step further, we build failure-focused prediction methods and optimize them by considering new features. Interestingly, we find that these methods perform better than the test-focused ones, with an MCC increasing from 0.20 to 0.42. Overall, our findings imply that on the one hand future research should focus on predicting flaky test failures instead of flaky tests and the need for adopting more thorough experimental methodologies when evaluating flakiness prediction methods, on the other.

翻译：脆性测试是指同一程序版本在不同执行中既可能通过也可能失败的测试。这类测试会因引发开发者调查虚假警报（脆性测试失败）而浪费宝贵时间。为解决此问题，学界已提出多种识别脆性测试的预测方法。尽管这些方法前景可期，但由于尚未在持续集成流程中进行评估，其实际效用仍不明确。特别是在不同CI周期中，遗漏故障（即将故障触发测试失败误判为脆性）的影响仍属未知。为填补这一空白，我们在Chromium CI中应用了当前最先进的脆性预测方法并评估其性能。出乎意料的是，尽管这些方法具有高精度（99.2%），但其应用导致约76.2%的回归故障被遗漏。为解释该现象，我们分析了故障触发失败案例，发现脆性测试具有极强的故障揭示能力——可揭示超过1/3的回归故障，这表明所有聚焦于识别脆性测试（而非脆性测试失败）的方法存在固有局限性。更进一步，我们构建了面向失败的预测方法，并通过引入新特征进行优化。有趣的是，这些方法的表现优于面向测试的方法，其MCC值从0.20提升至0.42。总体而言，本研究启示：一方面未来研究应聚焦于预测脆性测试失败而非脆性测试；另一方面需采用更严谨的实验方法论评估脆性预测方法。