Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. the study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature, while fp4 and fp4-dq proves to be a more suitable choice for falcon series of models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit higher sensitivity to temperature in the range of 0.5 to 0.8, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.

翻译：大型语言模型（LLM）的规模正在迅速扩大，参数数量已成为许多商业模型（如ChatGPT、Claude和Bard）成功的关键因素。即便是近期发布的面向商业用途的公开可用模型，例如Falcon和Llama2，也配备了数十亿级别的参数。参数数量的显著增长使得部署和运行成本极为高昂。量化领域在大规模神经网络（尤其是LLM）上取得的显著进展，使得这些模型能够部署在消费级GPU上，从而变得更加易用。量化模型通常表现出与其未量化的基础模型相当的性能水平。然而，我们对于这些量化模型如何响应超参数（例如温度、最大新生成令牌数和topk）的理解仍存在明显不足，尤其是在下一个词预测任务中。当前分析揭示，nf4和fp4是同样高效的4位量化技术，具有相似的特征，如推理速度、内存消耗和生成内容的质量。研究发现，在llama2系列模型中，nf4在低温条件下对温度变化表现出更强的鲁棒性，而fp4和fp4-dq则是falcon系列模型更合适的选择。值得注意的是，总体而言，不同规模的4位量化模型在温度范围为0.5至0.8时表现出更高的敏感性，这与未量化的同类模型不同。此外，int8量化与显著更慢的推理速度相关，而未量化的bfloat16模型在所有规模的模型上始终提供最快的推理速度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日