SWARM Parallelism: Training Large Models Can Be Surprisingly Communication-Efficient

Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.

翻译：许多深度学习应用受益于使用具有数十亿参数的大模型。由于需要专门的HPC集群，这些模型的训练成本众所周知地高昂。在本工作中，我们考虑了训练大模型的替代方案：使用廉价的“可抢占式”实例，或汇集来自多个区域的现有资源。我们分析了现有模型并行算法在这些条件下的性能，并发现了训练更大模型时通信密集度反而降低的配置。基于这些发现，我们提出了SWARM并行，一种专为连接性差、异构且不可靠设备设计的模型并行训练算法。SWARM在节点间创建临时随机化流水线，并在发生故障时进行重新平衡。我们通过实验验证了我们的发现，并将SWARM并行与现有大规模训练方法进行了比较。最后，我们将我们的见解与压缩策略相结合，在可抢占式T4 GPU上以低于200Mb/s的网络速度训练了一个拥有10亿共享参数（共享前约130亿参数）的大型Transformer语言模型。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/