Inferring the root cause of failures among thousands of components in a data center network is challenging, especially for "gray" failures that are not reported directly by switches. Faults can be localized through end-to-end measurements, but past localization schemes are either too slow for large-scale networks or sacrifice accuracy. We describe Flock, a network fault localization algorithm and system that achieves both high accuracy and speed at datacenter scale. Flock uses a probabilistic graphical model (PGM) to achieve high accuracy, coupled with new techniques to dramatically accelerate inference in discrete-valued Bayesian PGMs. Large-scale simulations and experiments in a hardware testbed show Flock speeds up inference by >10000x compared to past PGM methods, and improves accuracy over the best previous datacenter fault localization approaches, reducing inference error by 1.19-11x on the same input telemetry, and by 1.2-55x after incorporating passive telemetry. We also prove Flock's inference is optimal in restricted settings
翻译:在数据中心网络成千上万个组件中推断故障根因极具挑战性,尤其对于交换机无法直接报告的"灰色"故障。通过端到端测量可定位故障,但现有方案在大规模网络中速度过慢或牺牲精度。本文描述Flock——一种在数据中心规模同时实现高精度与高速率的网络故障定位算法及系统。Flock采用概率图模型(PGM)实现高精度,并融合新技术大幅加速离散值贝叶斯PGM的推理过程。大规模仿真与硬件试验台实验表明:相比以往PGM方法,Flock的推理速度提升超10000倍;相较于最佳现有数据中心故障定位方案,Flock将相同输入遥测数据的推理误差降低1.19-11倍,引入被动遥测数据后可降低1.2-55倍。我们还证明Flock的推理在受限场景下具有最优性。