Training large language models (LLMs) encounters challenges in GPU memory consumption due to the high memory requirements of model states. The widely used Zero Redundancy Optimizer (ZeRO) addresses this issue through strategic sharding but introduces communication challenges at scale. To tackle this problem, we propose AMSP, a system designed to optimize ZeRO for scalable LLM training. AMSP incorporates three flexible sharding strategies: Full-Replica, Full-Sharding, and Partial-Sharding, and allows each component within the model states (Parameters, Gradients, Optimizer States) to independently choose a sharding strategy as well as the device mesh. We conduct a thorough analysis of communication costs, formulating an optimization problem to discover the optimal sharding strategy. Additionally, AMSP optimizes distributed LLM training by efficiently overlapping communication with computation. Evaluations demonstrate up to 52\% Model FLOPs Utilization (MFU) when training the LLaMA-based model on 1024 GPUs, resulting in a 1.56 times improvement in training throughput compared to newly proposed systems like MiCS and ZeRO++.
翻译:训练大语言模型(LLM)因模型状态的高内存需求而面临GPU内存消耗的挑战。广泛使用的零冗余优化器(ZeRO)通过策略性分片解决了这一问题,但在大规模训练中引入了通信挑战。为解决该问题,我们提出AMSP,一种专为优化ZeRO以实现可扩展LLM训练而设计的系统。AMSP集成了三种灵活的分片策略:完全副本(Full-Replica)、完全分片(Full-Sharding)和部分分片(Partial-Sharding),并允许模型状态中的每个组件(参数、梯度、优化器状态)独立选择分片策略及设备网格。我们对通信开销进行了深入分析,构建优化问题以发现最优分片策略。此外,AMSP通过将通信与计算高效重叠来优化分布式LLM训练。评估表明,在1024个GPU上训练基于LLaMA的模型时,模型算力利用率(MFU)最高达52%,与MiCS和ZeRO++等最新系统相比,训练吞吐量提升了1.56倍。