As state-of-the-art neural networks scale to billions of parameters, designing parallel algorithms that can train these networks efficiently on multi-GPU clusters has become critical. This paper presents Tensor3D, a novel three-dimensional (3D) approach to parallelize tensor computations, that strives to minimize the idle time incurred due to communication in parallel training of large multi-billion parameter models. First, we introduce an intelligent distribution of neural network parameters across GPUs that eliminates communication required for satisfying data dependencies of individual layers. Then, we propose a novel overdecomposition of the parallel training process, using which we achieve significant overlap of communication with computation, thereby reducing GPU idle time. Finally, we present a communication model, which helps users identify communication optimal decompositions of available hardware resources for a given neural network. For a 28B parameter CNN on 256 A100 GPUs, Tensor3D improves the training time by nearly 60% as compared to Megatron-LM.
翻译:随着最先进的神经网络规模扩展至数十亿参数,设计能够高效训练这些网络的多GPU集群并行算法变得至关重要。本文提出Tensor3D——一种新颖的三维(3D)张量计算并行方法,旨在最小化训练数十亿参数大模型时因通信产生的空闲时间。首先,我们引入跨GPU的智能神经网络参数分布策略,消除满足各层数据依赖所需的通信开销。其次,提出一种并行训练过程的过度分解方法,通过显著重叠通信与计算来降低GPU空闲时间。最后,我们构建通信模型,帮助用户针对给定神经网络识别可用硬件资源的通信最优分解方案。在256个A100 GPU上训练28B参数的CNN时,Tensor3D相比Megatron-LM将训练时间提升近60%。