In principle, Continuous Integration (CI) pipeline failures provide valuable feedback to developers on code-related errors. In practice, however, pipeline jobs often fail intermittently due to non-deterministic tests, network outages, infrastructure failures, resource exhaustion, and other reliability issues. These intermittent (flaky) job failures lead to substantial inefficiencies: wasted computational resources from repeated reruns and significant diagnosis time that distracts developers from core activities and often requires intervention from specialized teams. Prior work has proposed machine learning techniques to detect intermittent failures, but does not address the subsequent diagnosis challenge. To fill this gap, we introduce FlaXifyer, a few-shot learning approach for predicting intermittent job failure categories using pre-trained language models. FlaXifyer requires only job execution logs and achieves 84.3% Macro F1 and 92.0% Top-2 accuracy with just 12 labeled examples per category. We also propose LogSift, an interpretability technique that identifies influential log statements in under one second, reducing review effort by 74.4% while surfacing relevant failure information in 87% of cases. Evaluation on 2,458 job failures from TELUS demonstrates that FlaXifyer and LogSift enable effective automated triage, accelerate failure diagnosis, and pave the way towards the automated resolution of intermittent job failures.
翻译:原则上,持续集成(CI)流水线故障能为开发者提供有关代码错误的宝贵反馈。然而在实践中,流水线作业常因非确定性测试、网络中断、基础设施故障、资源耗尽及其他可靠性问题而间歇性失败。这些间歇性(不稳定)作业故障会导致严重的效率低下:重复运行浪费计算资源,大量诊断时间使开发者分心于核心活动之外,且往往需要专业团队的介入。先前研究已提出机器学习技术来检测间歇性故障,但未解决后续的诊断挑战。为填补这一空白,我们提出了FlaXifyer——一种利用预训练语言模型预测间歇性作业故障类别的少样本学习方法。FlaXifyer仅需作业执行日志,在每类别仅需12个标注样本的情况下即可达到84.3%的宏平均F1分数和92.0%的Top-2准确率。我们还提出了LogSift——一种可解释性技术,能在1秒内识别关键日志语句,将审查工作量减少74.4%,同时在87%的情况下提取出相关故障信息。基于TELUS平台2,458个作业故障的评估表明,FlaXifyer与LogSift能实现有效的自动化故障分诊,加速故障诊断进程,并为间歇性作业故障的自动化解决铺平道路。