SprayCheck: Finding Gray Failures in Adaptive Routing Networks

Distributed machine learning (ML) training has become a dominant workload in modern data center networks, operating at massive scale with clusters comprising tens to hundreds of thousands of GPUs. The scale of these networks makes failures, and particularly gray failures, inevitable. Gray failures can significantly degrade both network and application performance, yet they are notoriously difficult to detect, localize, and debug. To meet the performance demands of ML workloads, adaptive routing is widely deployed to maximize network utilization by dynamically spreading traffic across many paths. While adaptive routing increases network utilization, it also greatly intensifies the effect of gray failures. Prior work has either dismissed gray failures as negligible or proposed detection mechanisms that fail to scale, rendering these approaches increasingly impractical for large-scale clusters. We present SprayCheck, a passive gray failure detection system that leverages the statistical properties of adaptive routing and network load balancing. By combining these properties with flow-level information, SprayCheck can identify failures before they significantly impact application performance, enabling preemptive rerouting and improving overall performance. Importantly, this is achieved through passive observation of traffic spraying, without introducing additional load on the network. We evaluate SprayCheck and show that it can detect and localize a single-link packet-drop-rate $1.5\%$ within a single iteration and as little as $0.5\%$ within 5 training iterations of Llama-3 70B in a 64 spine topology.

翻译：分布式机器学习训练已成为现代数据中心网络中的主导工作负载，其规模庞大，集群包含数万至数十万个GPU。网络规模的扩大使得故障——尤其是灰色故障——不可避免。灰色故障会显著降低网络和应用性能，但因其难以检测、定位和调试而臭名昭著。为满足机器学习工作负载的性能需求，自适应路由被广泛部署，通过动态地将流量分散到多条路径以最大化网络利用率。虽然自适应路由提高了网络利用率，但也极大地加剧了灰色故障的影响。先前的工作要么认为灰色故障可忽略不计，要么提出的检测机制难以扩展，导致这些方法对大规模集群越来越不实用。我们提出了SprayCheck，一种被动式灰色故障检测系统，利用自适应路由和网络负载均衡的统计特性。通过将这些特性与流级信息相结合，SprayCheck能在故障显著影响应用性能之前将其识别出来，从而实现预防性重路由并提升整体性能。重要的是，这是通过被动观测流量喷洒实现的，不会给网络引入额外负载。我们评估了SprayCheck，并证明在64脊柱拓扑中，对Llama-3 70B模型，它能在单次训练迭代内检测并定位到1.5%的单链路丢包率，在5次训练迭代内可低至0.5%。