The increasing scale of Transformer models has led to an increase in their pre-training computational requirements. While quantization has proven to be effective after pre-training and during fine-tuning, applying quantization in Transformers during pre-training has remained largely unexplored at scale for language modeling. This study aims to explore the impact of quantization for efficient pre-training of Transformers, with a focus on linear layer components. By systematically applying straightforward linear quantization to weights, activations, gradients, and optimizer states, we assess its effects on model efficiency, stability, and performance during training. By offering a comprehensive recipe of effective quantization strategies to be applied during the pre-training of Transformers, we promote high training efficiency from scratch while retaining language modeling ability. Code is available at https://github.com/chandar-lab/EfficientLLMs.
翻译:Transformer模型规模的不断扩大导致其预训练计算需求日益增长。尽管量化技术已在预训练后和微调阶段被证明有效,但在语言建模任务中,大规模地将量化应用于Transformer预训练过程仍鲜有研究。本研究旨在探索量化对Transformer高效预训练的影响,重点关注线性层组件。通过系统地对权重、激活值、梯度和优化器状态应用简单的线性量化,我们评估了量化在训练过程中对模型效率、稳定性和性能的影响。通过提供一套适用于Transformer预训练的有效量化策略方案,我们在保持语言建模能力的同时,实现了从零开始的高效训练。代码发布于https://github.com/chandar-lab/EfficientLLMs。