Elixir: Train a Large Language Model on a Small GPU Cluster

In recent years, large language models have achieved great success due to their unprecedented size. However, training these models poses a challenge for most researchers as it requires a substantial number of GPUs. To reduce GPU memory usage, memory partitioning, and memory offloading have been proposed. These approaches eliminate memory redundancies and offload memory usage to the CPU and NVMe memory, respectively, enabling training on small GPU clusters. However, directly deploying these solutions often leads to suboptimal efficiency. Only experienced experts can unleash the full potential of hardware by carefully tuning the distributed configuration. Thus, we present a novel solution, Elixir, which automates efficient large-model training based on pre-runtime model profiling. Elixir aims to identify the optimal combination of partitioning and offloading techniques to maximize training throughput. In our experiments, Elixir significantly outperforms the current state-of-the-art baseline. Our optimal configuration achieves up to a 3.4$\times$ speedup on GPT-2 models compared with SOTA solutions. We hope that our work will benefit individuals who lack computing resources and expertise, granting them access to large models. The beta version of Elixir is now available at https://github.com/hpcaitech/ColossalAI/tree/feature/elixir.

翻译：摘要：近年来，大语言模型凭借其前所未有的规模取得了巨大成功。然而，训练这些模型需要大量GPU，这对大多数研究人员而言构成挑战。为降低GPU内存使用，研究者提出了内存分区与内存卸载技术。这些方法分别消除内存冗余并将内存使用卸载至CPU及NVMe内存，从而支持在小型GPU集群上进行训练。然而，直接部署这些解决方案往往导致效率欠佳。只有经验丰富的专家通过精心调优分布式配置才能充分发挥硬件潜力。为此，我们提出创新解决方案Elixir，该方案基于运行时前的模型剖析自动实现高效的大模型训练。Elixir旨在识别分区与卸载技术的最优组合，以最大化训练吞吐量。实验中，Elixir显著优于当前最先进的基线方案。与现有最优方案相比，我们的最优配置在GPT-2模型上实现了最高3.4倍的加速。我们期望本工作能惠及缺乏计算资源与专业知识的个人，使其获得大模型的使用能力。Elixir测试版现已发布于https://github.com/hpcaitech/ColossalAI/tree/feature/elixir。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

【Google】深度学习对抗鲁棒性，43页ppt

专知会员服务

47+阅读 · 2020年10月31日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日