Low precision training and inference affect both the quality and cost of language models, but current scaling laws do not account for this. In this work, we devise "precision-aware" scaling laws for both training and inference. We propose that training in lower precision reduces the model's "effective parameter count," allowing us to predict the additional loss incurred from training in low precision and post-train quantization. For inference, we find that the degradation introduced by post-training quantization increases as models are trained on more data, eventually making additional pretraining data actively harmful. For training, our scaling laws allow us to predict the loss of a model with different parts in different precisions, and suggest that training larger models in lower precision may be compute optimal. We unify the scaling laws for post and pretraining quantization to arrive at a single functional form that predicts degradation from training and inference in varied precisions. We fit on over 465 pretraining runs and validate our predictions on model sizes up to 1.7B parameters trained on up to 26B tokens.
翻译:低精度训练与推理同时影响语言模型的质量与成本,但现有缩放定律未考虑这一因素。本研究针对训练与推理分别构建了“精度感知”的缩放定律。我们提出:低精度训练会降低模型的“有效参数量”,据此可预测低精度训练及训练后量化带来的额外损失。在推理方面,我们发现训练后量化引入的性能衰减会随模型训练数据量的增加而加剧,最终导致额外的预训练数据反而产生负面效果。对于训练过程,我们的缩放定律能够预测不同模块采用不同精度时模型的损失情况,并表明以较低精度训练更大模型可能在计算效率上达到最优。我们将训练后量化与训练中量化的缩放定律统一为单一函数形式,用以预测不同精度训练与推理导致的性能衰减。基于超过465次预训练实验进行参数拟合,并在最大1.7B参数、26B训练令牌量的模型规模上验证了预测结果。