Continuous Integration (CI) is widely used to provide rapid feedback on code changes; however, CI build outcomes are not always reliable. Builds may fail intermittently due to non-deterministic factors, leading to flaky builds that undermine developers' trust in CI, waste computational resources, and threaten the validity of CI-related empirical studies. In this paper, we present a large-scale empirical study of flaky builds in GitHub Actions based on rerun data from 1,960 open-source Java projects. Our results show that 3.2% of builds are rerun, and 67.73% of these rerun builds exhibit flaky behavior, affecting 1,055 (51.28%) of the projects. Through an in-depth failure analysis, we identify 15 distinct categories of flaky failures, among which flaky tests, network issues, and dependency resolution issues are the most prevalent. Building on these findings, we propose a machine learning-based approach for detecting flaky failures at the job level. Compared with a state-of-the-art baseline, our approach improves the F1-score by up to 20.3%.
翻译:持续集成(CI)被广泛用于提供代码变更的快速反馈;然而,CI构建结果并非总是可靠的。由于非确定性因素,构建可能会间歇性失败,导致不稳定构建,从而削弱开发者对CI的信任、浪费计算资源,并威胁到CI相关实证研究的有效性。本文基于对1,960个开源Java项目的重运行数据,对GitHub Actions中的不稳定构建进行了大规模实证研究。我们的研究结果表明,有3.2%的构建被重运行,其中67.73%的重运行构建表现出不稳定行为,影响了1,055个(占51.28%)项目。通过深入的故障分析,我们识别出15种不同类型的不稳定故障,其中不稳定测试、网络问题和依赖解析问题最为普遍。基于这些发现,我们提出了一种基于机器学习的方法,用于在作业级别检测不稳定故障。与最先进的基线方法相比,我们的方法将F1分数提高了高达20.3%。