Elixir: Train a Large Language Model on a Small GPU Cluster

In recent years, the number of parameters of one deep learning (DL) model has been growing much faster than the growth of GPU memory space. People who are inaccessible to a large number of GPUs resort to heterogeneous training systems for storing model parameters in CPU memory. Existing heterogeneous systems are based on parallelization plans in the scope of the whole model. They apply a consistent parallel training method for all the operators in the computation. Therefore, engineers need to pay a huge effort to incorporate a new type of model parallelism and patch its compatibility with other parallelisms. For example, Mixture-of-Experts (MoE) is still incompatible with ZeRO-3 in Deepspeed. Also, current systems face efficiency problems on small scale, since they are designed and tuned for large-scale training. In this paper, we propose Elixir, a new parallel heterogeneous training system, which is designed for efficiency and flexibility. Elixir utilizes memory resources and computing resources of both GPU and CPU. For flexibility, Elixir generates parallelization plans in the granularity of operators. Any new type of model parallelism can be incorporated by assigning a parallel pattern to the operator. For efficiency, Elixir implements a hierarchical distributed memory management scheme to accelerate inter-GPU communications and CPU-GPU data transmissions. As a result, Elixir can train a 30B OPT model on an A100 with 40GB CUDA memory, meanwhile reaching 84% efficiency of PyTorch GPU training. With its super-linear scalability, the training efficiency becomes the same as Pytorch GPU training on multiple GPUs. Also, large MoE models can be trained 5.3x faster than dense models of the same size. Now Elixir is integrated into ColossalAI and is available on its main branch.

翻译：近年来，深度学习（DL）模型的参数量增长远超GPU内存空间的扩展。无法获取大量GPU资源的用户转向异构训练系统，将模型参数存储于CPU内存中。现有异构系统基于全局模型的并行化方案，对计算中所有算子采用统一的并行训练方法。因此，工程师需要投入大量精力来引入新型模型并行方法并修复其与其他并行方法的兼容性。例如，混合专家模型（MoE）仍无法与DeepSpeed中的ZeRO-3兼容。此外，当前系统因针对大规模训练设计与优化，在小规模场景下面临效率问题。本文提出Elixir——一种面向高效性与灵活性的新型并行异构训练系统。Elixir同时利用GPU与CPU的内存及计算资源。在灵活性方面，Elixir以算子为粒度生成并行化方案，通过为算子分配并行模式即可融入任何新型模型并行方法。在高效性方面，Elixir实现分层式分布式内存管理方案以加速GPU间通信与CPU-GPU数据传输。实验表明，Elixir可在配备40GB CUDA内存的A100上训练300亿参数的OPT模型，同时达到PyTorch GPU训练效率的84%。凭借超线性可扩展性，其多GPU训练效率与PyTorch GPU训练持平。此外，同等规模的稠密模型相比，大型MoE模型的训练速度可提升5.3倍。目前Elixir已集成至ColossalAI的主分支并开放使用。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

NeurlPS 2022 | 自然语言处理相关论文分类整理

专知会员服务

51+阅读 · 2022年10月2日

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【NLP模型压缩方法综述】《A Survey of Methods for Model Compression in NLP》by Madison May

专知会员服务

43+阅读 · 2020年4月22日

50+篇《神经架构搜索NAS》2020论文合集

专知会员服务

61+阅读 · 2020年3月19日