Distributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applications when multiple collectives circularly wait for each other. GPU collective deadlocks pose a significant challenge to the correct functioning and efficiency of distributed deep learning, and no general effective solutions are currently available. Only in specific scenarios, ad-hoc methods, making an application invoke collectives in a consistent order across GPUs, can be used to prevent circular collective dependency and deadlocks. This paper presents DFCCL, a novel GPU collective communication library that provides a comprehensive approach for GPU collective deadlock prevention while maintaining high performance. DFCCL achieves preemption for GPU collectives at the bottom library level, effectively preventing deadlocks even if applications cause circular collective dependency. DFCCL ensures high performance with its execution and scheduling methods for collectives. Experiments show that DFCCL effectively prevents GPU collective deadlocks in various situations. Moreover, extensive evaluations demonstrate that DFCCL delivers performance comparable to or superior to NCCL, the state-of-the-art collective communication library highly optimized for NVIDIA GPUs.
翻译:分布式深度神经网络训练需要高效的GPU集体通信,而此类通信本质上容易发生死锁。当多个集体操作循环等待彼此时,GPU集体死锁在分布式深度学习应用中极易出现。GPU集体死锁对分布式深度学习的正确运行和效率构成重大挑战,目前尚无普遍有效的解决方案。仅在特定场景下,可采用临时方法——使应用程序在GPU间以一致顺序调用集体操作——来预防循环集体依赖和死锁。本文提出DFCCL,一种新颖的GPU集体通信库,为GPU集体死锁预防提供了全面解决方案,同时保持高性能。DFCCL在底层库级别实现GPU集体操作的抢占机制,即使应用程序引发循环集体依赖也能有效预防死锁。DFCCL通过其集体操作的执行与调度方法确保高性能。实验表明DFCCL能在多种场景下有效预防GPU集体死锁。此外,大量评估证明DFCCL提供与NCCL(专为NVIDIA GPU高度优化的最先进集体通信库)相当或更优的性能表现。