Large language models (LLMs) are rapidly increasing in size, with the number of parameters becoming a key factor in the success of many commercial models, such as ChatGPT, Claude, and Bard. Even the recently released publicly accessible models for commercial usage, such as Falcon and Llama2, come equipped with billions of parameters. This significant increase in the number of parameters makes deployment and operation very costly. The remarkable progress in the field of quantization for large neural networks in general and LLMs in particular, has made these models more accessible by enabling them to be deployed on consumer-grade GPUs. Quantized models generally demonstrate comparable performance levels to their unquantized base counterparts. Nonetheless, there exists a notable gap in our comprehensive understanding of how these quantized models respond to hyperparameters, such as temperature, max new tokens, and topk, particularly for next word prediction. The present analysis reveals that nf4 and fp4 are equally proficient 4-bit quantization techniques, characterized by similar attributes such as inference speed, memory consumption, and the quality of generated content. the study identifies nf4 as displaying greater resilience to temperature variations in the case of the llama2 series of models at lower temperature, while fp4 and fp4-dq proves to be a more suitable choice for falcon series of models. It is noteworthy that, in general, 4-bit quantized models of varying sizes exhibit higher sensitivity to temperature in the range of 0.5 to 0.8, unlike their unquantized counterparts. Additionally, int8 quantization is associated with significantly slower inference speeds, whereas unquantized bfloat16 models consistently yield the fastest inference speeds across models of all sizes.
翻译:大型语言模型(LLM)的规模正在迅速扩大,参数数量已成为许多商业模型(如ChatGPT、Claude和Bard)成功的关键因素。即便是近期发布的面向商业用途的公开可用模型,例如Falcon和Llama2,也配备了数十亿级别的参数。参数数量的显著增长使得部署和运行成本极为高昂。量化领域在大规模神经网络(尤其是LLM)上取得的显著进展,使得这些模型能够部署在消费级GPU上,从而变得更加易用。量化模型通常表现出与其未量化的基础模型相当的性能水平。然而,我们对于这些量化模型如何响应超参数(例如温度、最大新生成令牌数和topk)的理解仍存在明显不足,尤其是在下一个词预测任务中。当前分析揭示,nf4和fp4是同样高效的4位量化技术,具有相似的特征,如推理速度、内存消耗和生成内容的质量。研究发现,在llama2系列模型中,nf4在低温条件下对温度变化表现出更强的鲁棒性,而fp4和fp4-dq则是falcon系列模型更合适的选择。值得注意的是,总体而言,不同规模的4位量化模型在温度范围为0.5至0.8时表现出更高的敏感性,这与未量化的同类模型不同。此外,int8量化与显著更慢的推理速度相关,而未量化的bfloat16模型在所有规模的模型上始终提供最快的推理速度。