Cross-Project Flakiness: A Case Study of the OpenStack Ecosystem

Automated regression testing is a cornerstone of modern software development, often contributing directly to code review and Continuous Integration (CI). Yet some tests suffer from flakiness, where their outcomes vary non-deterministically. Flakiness erodes developer trust in test results, wastes computational resources, and undermines CI reliability. While prior research has examined test flakiness within individual projects, its broader ecosystem-wide impact remains largely unexplored. In this paper, we present an empirical study of test flakiness in the OpenStack ecosystem, which focuses on (1) cross-project flakiness, where flaky tests impact multiple projects, and (2) inconsistent flakiness, where a test exhibits flakiness in some projects but remains stable in others. By analyzing 649 OpenStack projects, we identify 1,535 cross-project flaky tests and 1,105 inconsistently flaky tests. We find that cross-project flakiness affects 55% of OpenStack projects and significantly increases both review time and computational costs. Surprisingly, 70% of unit tests exhibit cross-project flakiness, challenging the assumption that unit tests are inherently insulated from issues that span modules like integration and system-level tests. Through qualitative analysis, we observe that race conditions in CI, inconsistent build configurations, and dependency mismatches are the primary causes of inconsistent flakiness. These findings underline the need for better coordination across complex ecosystems, standardized CI configurations, and improved test isolation strategies.

翻译：自动化回归测试是现代软件开发的基石，常直接服务于代码审查与持续集成（CI）。然而，部分测试存在不稳定性问题，其执行结果呈现非确定性变化。测试不稳定性会削弱开发者对测试结果的信任、浪费计算资源并损害CI的可靠性。尽管已有研究探讨了单个项目内部的测试不稳定性，但其在更广泛生态系统层面的影响尚未得到充分探索。本文针对OpenStack生态系统中的测试不稳定性开展实证研究，重点关注：（1）跨项目不稳定性——即不稳定测试影响多个项目；（2）不一致不稳定性——即同一测试在某些项目中表现不稳定，在其他项目中却保持稳定。通过分析649个OpenStack项目，我们识别出1,535个跨项目不稳定测试和1,105个不一致不稳定测试。研究发现：跨项目不稳定性影响55%的OpenStack项目，并显著增加代码审查时间与计算成本。令人惊讶的是，70%的单元测试表现出跨项目不稳定性，这挑战了“单元测试本质上不受跨模块问题（如集成测试和系统级测试）影响”的传统假设。通过定性分析，我们发现CI中的竞态条件、不一致的构建配置以及依赖项不匹配是导致不一致不稳定性的主要原因。这些发现表明，复杂生态系统需要加强跨项目协调、标准化CI配置并改进测试隔离策略。