Post-training quantization (PTQ) has emerged as a promising technique to reduce the cost of large language models (LLMs). Specifically, PTQ can effectively mitigate memory consumption and reduce computational overhead in LLMs. To meet the requirements of both high efficiency and performance across diverse scenarios, a comprehensive evaluation of quantized LLMs is essential to guide the selection of quantization methods. This paper presents a thorough evaluation of these factors by evaluating the effect of PTQ on Weight, Activation, and KV Cache on 11 model families, including OPT, LLaMA2, Falcon, Bloomz, Mistral, ChatGLM, Vicuna, LongChat, StableLM, Gemma, and Mamba, with parameters ranging from 125M to 180B. The evaluation encompasses five types of tasks: basic NLP, emergent ability, trustworthiness, dialogue, and long-context tasks. Moreover, we also evaluate the state-of-the-art (SOTA) quantization methods to demonstrate their applicability. Based on the extensive experiments, we systematically summarize the effect of quantization, provide recommendations to apply quantization techniques, and point out future directions.
翻译:后训练量化(PTQ)已成为降低大语言模型(LLMs)成本的一种有前景的技术。具体而言,PTQ能有效减少LLMs的内存消耗和计算开销。为满足不同场景下对高效率和高性能的需求,对量化后的LLMs进行全面评估以指导量化方法的选择至关重要。本文通过对11个模型家族(包括OPT、LLaMA2、Falcon、Bloomz、Mistral、ChatGLM、Vicuna、LongChat、StableLM、Gemma和Mamba,参数规模从125M到180B)中的权重、激活值和KV缓存进行PTQ效果评估,系统分析了这些因素。评估涵盖五类任务:基础自然语言处理、涌现能力、可信度、对话以及长上下文任务。此外,我们还评估了最先进的(SOTA)量化方法以展示其适用性。基于大量实验,我们系统总结了量化的影响,提出了应用量化技术的建议,并指出了未来研究方向。