Training large language models (LLMs) models directly in low-precision offers a way to address computational costs by improving both throughput and energy efficiency. For those purposes, NVIDIA's recent Blackwell architecture facilitates very low-precision operations using FP4 variants. Yet, current algorithms for training LLMs in FP4 precision face significant accuracy degradation and often rely on mixed-precision fallbacks. In this paper, we investigate hardware-supported FP4 training and introduce a new approach for accurate, end-to-end FP4 training with all the major computations (i.e., linear layers) in low precision. Through extensive evaluations on Llama-type models, we reveal a new low-precision scaling law that quantifies performance trade-offs across bit-widths and training setups. Guided by this investigation, we design an "optimal" technique in terms of accuracy-vs-computation, called Quartet. We implement Quartet using optimized CUDA kernels tailored for Blackwell, demonstrating that fully FP4-based training is a competitive alternative to FP16 half-precision and to FP8 training. Our code is available at https://github.com/IST-DASLab/Quartet.
翻译:直接以低精度训练大型语言模型(LLMs)能够通过提升计算吞吐量和能效来应对高昂的计算成本。为此,英伟达最新的Blackwell架构支持使用FP4变体进行极低精度运算。然而,当前在FP4精度下训练LLM的算法存在显著的精度损失,且通常依赖混合精度回退机制。本文研究了硬件支持的FP4训练,并提出一种新颖的端到端FP4训练方法,该方法将所有主要计算(即线性层)保持在低精度下进行。通过对Llama系列模型的大量评估,我们揭示了一种新的低精度缩放定律,该定律量化了不同比特宽度与训练配置之间的性能权衡。基于此研究,我们设计了一种在精度与计算效率权衡意义上“最优”的技术——Quartet。我们使用针对Blackwell架构优化的CUDA内核实现了Quartet,证明完全基于FP4的训练能够成为FP16半精度训练和FP8训练的有力替代方案。代码已开源:https://github.com/IST-DASLab/Quartet。