In recent years, large language models have achieved great success due to their unprecedented size. However, training these models poses a challenge for most researchers as it requires a substantial number of GPUs. To reduce GPU memory usage, memory partitioning, and memory offloading have been proposed. These approaches eliminate memory redundancies and offload memory usage to the CPU and NVMe memory, respectively, enabling training on small GPU clusters. However, directly deploying these solutions often leads to suboptimal efficiency. Only experienced experts can unleash the full potential of hardware by carefully tuning the distributed configuration. Thus, we present a novel solution, Elixir, which automates efficient large-model training based on pre-runtime model profiling. Elixir aims to identify the optimal combination of partitioning and offloading techniques to maximize training throughput. In our experiments, Elixir significantly outperforms the current state-of-the-art baseline. Our optimal configuration achieves up to a 3.4$\times$ speedup on GPT-2 models compared with SOTA solutions. We hope that our work will benefit individuals who lack computing resources and expertise, granting them access to large models. The beta version of Elixir is now available at https://github.com/hpcaitech/ColossalAI/tree/feature/elixir.
翻译:摘要:近年来,大语言模型凭借其前所未有的规模取得了巨大成功。然而,训练这些模型需要大量GPU,这对大多数研究人员而言构成挑战。为降低GPU内存使用,研究者提出了内存分区与内存卸载技术。这些方法分别消除内存冗余并将内存使用卸载至CPU及NVMe内存,从而支持在小型GPU集群上进行训练。然而,直接部署这些解决方案往往导致效率欠佳。只有经验丰富的专家通过精心调优分布式配置才能充分发挥硬件潜力。为此,我们提出创新解决方案Elixir,该方案基于运行时前的模型剖析自动实现高效的大模型训练。Elixir旨在识别分区与卸载技术的最优组合,以最大化训练吞吐量。实验中,Elixir显著优于当前最先进的基线方案。与现有最优方案相比,我们的最优配置在GPT-2模型上实现了最高3.4倍的加速。我们期望本工作能惠及缺乏计算资源与专业知识的个人,使其获得大模型的使用能力。Elixir测试版现已发布于https://github.com/hpcaitech/ColossalAI/tree/feature/elixir。