Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a 13B model.
翻译:高效训练大型语言模型需要在数百个硬件加速器上进行并行化,并调用各种计算和内存优化策略。当这些策略组合使用时,许多方案在最终训练效率上会产生复杂的相互影响。此前针对该问题的研究工作未能利用最新的优化技术,例如FlashAttention或序列并行化。在本研究中,我们对大型语言模型的可行训练配置进行了全面的消融实验,并从中提炼出若干关键建议以实现最高效的训练。例如,我们发现使用微批大小为1通常能够实现最高效的训练布局。较大的微批大小需要激活检查点或更高程度的模型并行化,同时也会导致更大的流水线气泡。我们最高效的配置使我们在多种模型规模上实现了最先进的训练效率结果,其中在训练13B模型时,模型算力利用率达到了70.5%。