Understanding the Impact of Post-Training Quantization on Large Language Models

Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. Nevertheless, these quantization methods exhibit distinct behaviors at varying temperature settings, both in the context of smaller and larger models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit heightened sensitivity to lower temperature settings, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized fp16 models consistently yield the fastest inference speeds across models of all sizes.

翻译：大规模语言模型（LLM）的规模正迅速增长，参数数量成为许多商业模型（如ChatGPT、Claude和Bard）成功的关键因素。即便是近期发布的面向商业用途的公开可访问模型（如Falcon和Llama2），也配备了数十亿参数。这种参数数量的显著增加使得部署和运行成本极为高昂。针对大型神经网络（尤其是LLM）的量化领域取得的显著进展，通过将这些模型部署到消费级GPU上，使其更易获取。量化模型通常表现出与其未量化基础模型相当的性能水平。然而，我们对于这些量化模型如何响应超参数（如温度、最大新词数量和topk，特别是针对下一个词预测任务）仍缺乏全面理解。本文分析揭示，nf4和fp4是同等高效的4位量化技术，在推理速度、内存消耗和生成内容质量等方面具有相似特性。然而，这些量化方法在不同温度设置下表现出截然不同的行为，这一现象在小规模和大规模模型中均存在。值得注意的是，总体而言，不同规模的4位量化模型对较低温度设置表现出更高的敏感性，这与未量化模型形成对比。此外，int8量化与显著较慢的推理速度相关，而未量化的fp16模型在所有规模的模型中始终能实现最快的推理速度。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日