Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a Llama 13B model.
翻译:高效训练大型语言模型需要跨数百个硬件加速器进行并行化,并调用各种计算和内存优化策略。当这些策略组合使用时,它们之间会因最终训练效率而产生复杂的相互作用。先前解决该问题的工作未能使用最新优化技术,例如FlashAttention或序列并行性。在本研究中,我们对大型语言模型的可能训练配置进行了全面的消融实验。我们从这项大规模研究中提炼出若干关键建议,用于实现最高效的训练。例如,我们发现使用微批次大小为1通常能够实现最高效的训练布局。较大的微批次大小需要激活检查点或更高程度的模型并行性,还会导致更大的流水线气泡。我们最高效的配置使我们能够在多种模型规模上实现最先进的训练效率结果,最显著的是在训练Llama 13B模型时实现了70.5%的模型FLOPs利用率。