The growing computational demands of training large language models (LLMs) necessitate more efficient methods. Quantized training presents a promising solution by enabling low-bit arithmetic operations to reduce these costs. While FP8 precision has demonstrated feasibility, leveraging FP4 remains a challenge due to significant quantization errors and limited representational capacity. This work introduces the first FP4 training framework for LLMs, addressing these challenges with two key innovations: a differentiable quantization estimator for precise weight updates and an outlier clamping and compensation strategy to prevent activation collapse. To ensure stability, the framework integrates a mixed-precision training scheme and vector-wise quantization. Experimental results demonstrate that our FP4 framework achieves accuracy comparable to BF16 and FP8, with minimal degradation, scaling effectively to 13B-parameter LLMs trained on up to 100B tokens. With the emergence of next-generation hardware supporting FP4, our framework sets a foundation for efficient ultra-low precision training.
翻译:训练大语言模型(LLMs)日益增长的计算需求催生了对更高效方法的需求。量化训练通过启用低比特算术运算来降低这些成本,提供了一种颇具前景的解决方案。虽然FP8精度已证明其可行性,但由于显著的量化误差和有限表示能力,利用FP4仍然是一个挑战。本研究首次提出了用于LLMs的FP4训练框架,通过两项关键创新应对这些挑战:一种用于精确权重更新的可微分量化估计器,以及一种防止激活崩溃的异常值截断与补偿策略。为确保稳定性,该框架集成了混合精度训练方案和向量级量化。实验结果表明,我们的FP4框架实现了与BF16和FP8相当的精度,性能下降极小,并能有效扩展到在高达1000亿词元上训练的130亿参数LLMs。随着支持FP4的下一代硬件的出现,我们的框架为高效的超低精度训练奠定了基础。