As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.
翻译:随着训练规模的增长,集体通信库(CCL)愈发面临由硬件、软件及环境因素复杂交互引发的异常。此类异常通常表现为通信慢速/挂起,这是诊断中最常见且耗时最长的类别。然而,传统诊断方法仍存在精度低、效率差的问题,通常需要数小时甚至数天才能完成根因分析。为此,我们提出CCL-D——一种专为大规模分布式训练中慢速/挂起异常检测与定位设计的高精度诊断系统。CCL-D集成了rank级实时探针与智能决策分析器:探针通过轻量级分布式追踪框架测量跨层异常度量以监控通信流量,分析器则执行自动化异常检测与根因定位,精准识别故障GPU rank。在包含4000个GPU的集群上部署一年后,CCL-D实现了对已知慢速/挂起异常的近乎全覆盖,并在6分钟内精准定位受影响rank——性能显著优于现有解决方案。