In this paper, we explore FP8 low-bit data formats for efficient training of large language models (LLMs). Our key insight is that most variables, such as gradients and optimizer states, in LLM training can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameters. Specifically, we propose a new FP8 automatic mixed-precision framework for training LLMs. This framework offers three levels of FP8 utilization to streamline mixed-precision and distributed parallel training for LLMs. It gradually incorporates 8-bit gradients, optimizer states, and distributed learning in an incremental manner. Experiment results show that, during the training of GPT-175B model on H100 GPU platform, our FP8 mixed-precision training framework not only achieved a remarkable 42% reduction in real memory usage but also ran 64% faster than the widely adopted BF16 framework (i.e., Megatron-LM), surpassing the speed of Nvidia Transformer Engine by 17%. This largely reduces the training costs for large foundation models. Furthermore, our FP8 mixed-precision training methodology is generic. It can be seamlessly applied to other tasks such as LLM instruction tuning and reinforcement learning with human feedback, offering savings in fine-tuning expenses. Our FP8 low-precision training framework is open-sourced at {https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}.
翻译:本文探索了FP8低比特数据格式用于大语言模型(LLM)高效训练的方法。我们的核心发现是:LLM训练中的大部分变量(如梯度和优化器状态)可采用低精度数据格式,且不会影响模型精度,也无需调整超参数。具体而言,我们提出了一种新的FP8自动混合精度框架用于训练LLM。该框架提供三种FP8利用层级,旨在简化LLM的混合精度与分布式并行训练,并以渐进方式逐步融合8位梯度、优化器状态和分布式学习。实验结果表明,在H100 GPU平台训练GPT-175B模型时,所提出的FP8混合精度训练框架不仅实现了真实内存使用量减少42%的显著效果,运行速度更比广泛采用的BF16框架(即Megatron-LM)快64%,同时超越Nvidia Transformer Engine 17%。这大幅降低了大型基础模型的训练成本。此外,我们的FP8混合精度训练方法论具有通用性,可无缝应用于LLM指令微调、基于人类反馈的强化学习等其他任务,从而节省微调开销。本FP8低精度训练框架已在{https://github.com/Azure/MS-AMP}{aka.ms/MS.AMP}开源。