Efficient Parallelization Layouts for Large-Scale Distributed Model Training

from arxiv, Camera-ready version for the Workshop on Advancing Neural Network Training at 37th Conference on Neural Information Processing Systems (WANT@NeurIPS 2023)

Efficiently training large language models requires parallelizing across hundreds of hardware accelerators and invoking various compute and memory optimizations. When combined, many of these strategies have complex interactions regarding the final training efficiency. Prior work tackling this problem did not have access to the latest set of optimizations, such as FlashAttention or sequence parallelism. In this work, we conduct a comprehensive ablation study of possible training configurations for large language models. We distill this large study into several key recommendations for the most efficient training. For instance, we find that using a micro-batch size of 1 usually enables the most efficient training layouts. Larger micro-batch sizes necessitate activation checkpointing or higher degrees of model parallelism and also lead to larger pipeline bubbles. Our most efficient configurations enable us to achieve state-of-the-art training efficiency results over a range of model sizes, most notably a Model FLOPs utilization of 70.5% when training a Llama 13B model.

翻译：高效训练大型语言模型需要跨数百个硬件加速器进行并行化，并调用各种计算和内存优化策略。当这些策略组合使用时，它们之间会因最终训练效率而产生复杂的相互作用。先前解决该问题的工作未能使用最新优化技术，例如FlashAttention或序列并行性。在本研究中，我们对大型语言模型的可能训练配置进行了全面的消融实验。我们从这项大规模研究中提炼出若干关键建议，用于实现最高效的训练。例如，我们发现使用微批次大小为1通常能够实现最高效的训练布局。较大的微批次大小需要激活检查点或更高程度的模型并行性，还会导致更大的流水线气泡。我们最高效的配置使我们能够在多种模型规模上实现最先进的训练效率结果，最显著的是在训练Llama 13B模型时实现了70.5%的模型FLOPs利用率。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/