Fail-slows, or stragglers, are common but largely unheeded problems in large-scale hybrid-parallel training that spans thousands of GPU servers and runs for weeks to months. Yet, these problems are not well studied, nor can they be quickly detected and effectively mitigated. In this paper, we first present a characterization study on a shared production cluster with over 10,000 GPUs1. We find that fail-slows are caused by various CPU/GPU computation and cross-node networking issues, lasting from tens of seconds to nearly ten hours, and collectively delaying the average job completion time by 1.34%. The current practice is to manually detect these fail-slows and simply treat them as fail-stops using a checkpoint-and-restart failover approach, which are labor-intensive and time-consuming. In this paper, we propose FALCON, a framework that rapidly identifies fail-slowed GPUs and/or communication links, and effectively tackles them with a novel multi-level mitigation mechanism, all without human intervention. We have applied FALCON to detect human-labeled fail-slows in a production cluster with over 99% accuracy. Cluster deployment further demonstrates that FALCON effectively handles manually injected fail-slows, mitigating the training slowdown by 60.1%.
翻译:慢故障(或称慢节点)是大规模混合并行训练中常见但长期被忽视的问题,此类训练通常跨越数千台GPU服务器并持续数周至数月。然而,这些问题尚未得到充分研究,也无法被快速检测和有效缓解。本文首先在拥有超过10,000块GPU的生产共享集群上进行了特征研究。我们发现慢故障由各类CPU/GPU计算及跨节点网络问题引发,持续时间从数十秒到近十小时不等,累计使作业平均完成时间延迟1.34%。当前实践是通过人工检测这些慢故障,并简单地将其视为完全故障,采用检查点重启的故障转移方法进行处理,这种方式既耗费人力又耗时。本文提出FALCON框架,该框架能快速识别发生慢故障的GPU和/或通信链路,并通过创新的多级缓解机制有效处理这些问题,整个过程无需人工干预。我们将FALCON应用于生产集群中人工标记的慢故障检测,准确率超过99%。集群部署进一步证明,FALCON能有效处理人工注入的慢故障,将训练减速缓解60.1%。