Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. Nevertheless, these quantization methods exhibit distinct behaviors at varying temperature settings, both in the context of smaller and larger models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit heightened sensitivity to lower temperature settings, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized fp16 models consistently yield the fastest inference speeds across models of all sizes.
翻译:大规模语言模型(LLM)的规模正迅速增长,参数数量成为许多商业模型(如ChatGPT、Claude和Bard)成功的关键因素。即便是近期发布的面向商业用途的公开可访问模型(如Falcon和Llama2),也配备了数十亿参数。这种参数数量的显著增加使得部署和运行成本极为高昂。针对大型神经网络(尤其是LLM)的量化领域取得的显著进展,通过将这些模型部署到消费级GPU上,使其更易获取。量化模型通常表现出与其未量化基础模型相当的性能水平。然而,我们对于这些量化模型如何响应超参数(如温度、最大新词数量和topk,特别是针对下一个词预测任务)仍缺乏全面理解。本文分析揭示,nf4和fp4是同等高效的4位量化技术,在推理速度、内存消耗和生成内容质量等方面具有相似特性。然而,这些量化方法在不同温度设置下表现出截然不同的行为,这一现象在小规模和大规模模型中均存在。值得注意的是,总体而言,不同规模的4位量化模型对较低温度设置表现出更高的敏感性,这与未量化模型形成对比。此外,int8量化与显著较慢的推理速度相关,而未量化的fp16模型在所有规模的模型中始终能实现最快的推理速度。