Many deep learning applications benefit from using large models with billions of parameters. Training these models is notoriously expensive due to the need for specialized HPC clusters. In this work, we consider alternative setups for training large models: using cheap "preemptible" instances or pooling existing resources from multiple regions. We analyze the performance of existing model-parallel algorithms in these conditions and find configurations where training larger models becomes less communication-intensive. Based on these findings, we propose SWARM parallelism, a model-parallel training algorithm designed for poorly connected, heterogeneous and unreliable devices. SWARM creates temporary randomized pipelines between nodes that are rebalanced in case of failure. We empirically validate our findings and compare SWARM parallelism with existing large-scale training approaches. Finally, we combine our insights with compression strategies to train a large Transformer language model with 1B shared parameters (approximately 13B before sharing) on preemptible T4 GPUs with less than 200Mb/s network.
翻译:许多深度学习应用受益于使用具有数十亿参数的大模型。由于需要专门的HPC集群,这些模型的训练成本众所周知地高昂。在本工作中,我们考虑了训练大模型的替代方案:使用廉价的“可抢占式”实例,或汇集来自多个区域的现有资源。我们分析了现有模型并行算法在这些条件下的性能,并发现了训练更大模型时通信密集度反而降低的配置。基于这些发现,我们提出了SWARM并行,一种专为连接性差、异构且不可靠设备设计的模型并行训练算法。SWARM在节点间创建临时随机化流水线,并在发生故障时进行重新平衡。我们通过实验验证了我们的发现,并将SWARM并行与现有大规模训练方法进行了比较。最后,我们将我们的见解与压缩策略相结合,在可抢占式T4 GPU上以低于200Mb/s的网络速度训练了一个拥有10亿共享参数(共享前约130亿参数)的大型Transformer语言模型。