Quantization methods reduce the number of bits required to represent each parameter in a model, trading accuracy for smaller memory footprints and inference latencies. However, the final model size depends on both the number of parameters of the original model and the rate of compression. For example, a 30B 8-bit model and a 60B 4-bit model have the same number of bits but may have very different zero-shot accuracies. In this work, we study this trade-off by developing inference scaling laws of zero-shot performance in Large Language Models (LLMs) to determine the bit-precision and model size that maximizes zero-shot performance. We run more than 35,000 experiments with 16-bit inputs and k-bit parameters to examine which zero-shot quantization methods improve scaling for 3 to 8-bit precision at scales of 19M to 176B parameters across the LLM families BLOOM, OPT, NeoX/Pythia, and GPT-2. We find that it is challenging to improve the bit-level scaling trade-off, with the only improvements being the use of a small block size -- splitting the parameters into small independently quantized blocks -- and the quantization data type being used (e.g., Int vs Float). Overall, our findings show that {4-bit} precision is almost universally optimal for total model bits and zero-shot accuracy.
翻译:量化方法减少了表示模型中每个参数所需的比特数,以牺牲精度换取更小的内存占用和推理延迟。然而,最终模型大小取决于原始模型的参数量和压缩比率。例如,30B参数的8比特模型与60B参数的4比特模型具有相同的总比特数,但零样本准确率可能差异显著。本研究通过开发大语言模型(LLMs)零样本性能的推理缩放定律来研究这种权衡,以确定最大化零样本性能的比特精度和模型规模。我们开展了超过35,000次实验,采用16比特输入和k比特参数,在BLOOM、OPT、NeoX/Pythia和GPT-2等LLM系列中,以19M到176B参数规模探索3至8比特精度下哪些零样本量化方法能改善缩放效果。研究发现,改进比特级缩放权衡颇具挑战性,唯一有效的改进是采用小块大小(将参数拆分为独立量化的微小块)和量化数据类型(如整型与浮点型)。总体而言,我们的结果表明,就总模型比特数和零样本准确率而言,{4比特}精度几乎普遍最优。