Large Language Models (LLMs) have demonstrated impressive performance across various downstream tasks. When training these models, there is a growing inclination to process more tokens on larger training scales but with relatively smaller model sizes. Zero Redundancy Optimizer (ZeRO), although effective in conventional training environments, grapples with scaling challenges when confronted with this emerging paradigm. To this end, we propose a novel LLM training framework AMSP, which undertakes a granular partitioning of model states, encompassing parameters ($P$), gradient ($G$), and optimizer states ($OS$). Specifically, AMSP(1) builds a unified partitioning space, enabling independent partitioning strategies for $P$, $G$, and $OS$; (2) incorporates a scale-aware partitioner to autonomously search for optimal partitioning strategies: (3) designs a dedicated communication optimizer to ensure proficient management of data placement discrepancies arising from diverse partitioning strategies. Our evaluations show that AMSP achieves up to 90.3% scaling efficiency across 1024 GPUs.
翻译:大型语言模型(LLMs)在各类下游任务中展现出卓越性能。训练这些模型时,业界日益倾向于在更大规模上处理更多令牌,但采用相对较小的模型规模。零冗余优化器(ZeRO)虽然在传统训练环境中表现有效,但在应对这一新兴范式时面临扩展挑战。为此,我们提出新型LLM训练框架AMSP,该框架对模型状态进行细粒度划分,涵盖参数($P$)、梯度($G$)和优化器状态($OS$)。具体而言,AMSP:(1)构建统一划分空间,实现$P$、$G$和$OS$的独立分区策略;(2)融入尺度感知分区器,自主搜索最优分区策略:(3)设计专用通信优化器,确保高效管理不同分区策略引发的数据放置差异。评估表明,AMSP在1024个GPU上可实现高达90.3%的扩展效率。