A Practical Framework for Flaky Failure Triage in Distributed Database Continuous Integration

Flaky failure triage is crucial for keeping distributed database continuous integration (CI) efficient and reliable. After a failure is observed, operators must quickly decide whether to auto-rerun the job as likely flaky or escalate it as likely persistent, often under CPU-only millisecond budgets. Existing approaches remain difficult to deploy in this setting because they may rely on post-failure artifacts, produce poorly calibrated scores under telemetry and workload shifts, or learn from labels generated by finite rerun policies. To address these challenges, we present SCOUT, a practical state-aware causal online uncertainty-calibrated triage framework for distributed database CI. SCOUT uses only strict-causal features, including pre-failure telemetry and strictly historical data, to make online decisions without lookahead. Specifically, SCOUT combines lightweight state-aware scoring with optional sparse metadata fusion, applies post-hoc calibration to support fixed-threshold decisions across temporal and cross-domain shifts, and introduces a posterior-soft correction to reduce label bias induced by finite rerun budgets. We evaluated SCOUT on a benchmark of 3,680 labeled failed runs, including 462 flaky positives, and 62 telemetry/context features. Further, we studied the feasibility of SCOUT on TiDB v7/v8 and a large GitHub Actions metadata-only trace. The experimental results demonstrated its effectiveness and usefulness. We deployed SCOUT in the production environment, achieving an end-to-end P95 latency of 1.17 ms on CPU.

翻译：间歇性故障分类对于保持分布式数据库持续集成系统的高效性和可靠性至关重要。当故障被观测后，运维人员必须在毫秒级CPU预算内快速决策：是自动重试任务（若为疑似间歇性）、还是升级处理（若为疑似持久性）。现有方法在此场景下部署困难，原因在于它们可能依赖后故障工件、在遥测与工作负载偏移下产生校准不良的分数，或从有限重试策略生成的标签中学习。为应对这些挑战，我们提出了SCOUT——一种面向分布式数据库持续集成的实用状态感知因果在线不确定性校准分类框架。SCOUT仅利用严格因果特征（包括故障前遥测数据与严格历史数据）进行在线决策，无需前瞻预测。具体而言，SCOUT将轻量级状态感知评分与可选的稀疏元数据融合相结合；应用事后校准以支持跨时间域与跨域迁移下的固定阈值决策；并引入后验软校正以减少有限重试预算导致的标签偏差。我们基于包含3,680次标记故障运行（含462个间歇性阳性样本）及62维遥测/上下文特征的基准数据集对SCOUT进行了评估。此外，还在TiDB v7/v8及大规模GitHub Actions元数据仅轨迹上验证了SCOUT的可行性。实验结果证明了其有效性与实用性。我们已将SCOUT部署至生产环境，在CPU上实现了端到端P95延迟1.17毫秒。