CCL-D: A High-Precision Diagnostic System for Slow and Hang Anomalies in Large-Scale Model Training

Yida Gu,Fakang Wang,Jianhao Fu,Zhenhang Sun,Qianyu Zhang,Hairui Zhao,Xingchen Liu,Yang Tian,Wenjing Huang,Zedong Liu,Yifan Chen,Jinwu Yang,Yueyuan Zhou,Qian Zhao,Haoxu Li,Tao Wang,Feng Yu,Zhan Wang,Guangming Tan,Dingwen Tao

from arxiv, Accepted by PPoPP'26, 13 figures, 2 tables

As training scales grow, collective communication libraries (CCL) increasingly face anomalies arising from complex interactions among hardware, software, and environmental factors. These anomalies typically manifest as slow/hang communication, the most frequent and time-consuming category to diagnose. However, traditional diagnostic methods remain inaccurate and inefficient, frequently requiring hours or even days for root cause analysis. To address this, we propose CCL-D, a high-precision diagnostic system designed to detect and locate slow/hang anomalies in large-scale distributed training. CCL-D integrates a rank-level real-time probe with an intelligent decision analyzer. The probe measures cross-layer anomaly metrics using a lightweight distributed tracing framework to monitor communication traffic. The analyzer performs automated anomaly detection and root-cause location, precisely identifying the faulty GPU rank. Deployed on a 4,000-GPU cluster over one year, CCL-D achieved near-complete coverage of known slow/hang anomalies and pinpointed affected ranks within 6 minutes-substantially outperforming existing solutions.

翻译：随着训练规模的增长，集体通信库（CCL）愈发面临由硬件、软件及环境因素复杂交互引发的异常。此类异常通常表现为通信慢速/挂起，这是诊断中最常见且耗时最长的类别。然而，传统诊断方法仍存在精度低、效率差的问题，通常需要数小时甚至数天才能完成根因分析。为此，我们提出CCL-D——一种专为大规模分布式训练中慢速/挂起异常检测与定位设计的高精度诊断系统。CCL-D集成了rank级实时探针与智能决策分析器：探针通过轻量级分布式追踪框架测量跨层异常度量以监控通信流量，分析器则执行自动化异常检测与根因定位，精准识别故障GPU rank。在包含4000个GPU的集群上部署一年后，CCL-D实现了对已知慢速/挂起异常的近乎全覆盖，并在6分钟内精准定位受影响rank——性能显著优于现有解决方案。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Agentic RL：框架、实践与长程智能体训练

专知会员服务

10+阅读 · 6月24日

管理 LLM 智能体中的演进式记忆：风险、机理及稳定性与安全性受控记忆（SSGM）框架

专知会员服务

16+阅读 · 3月14日

《通过增强的多域指挥官关键信息需求（CCIR）过程“读取敌人思想”》

专知会员服务

37+阅读 · 2025年11月15日

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日