Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output-channel (per-OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs. Code is available at https://github.com/johnheo/adadim-llm
翻译:大型语言模型(LLMs)近期在各种任务中展现出显著成功。然而,由于内存瓶颈问题(尤其是在小批量推理场景中,例如移动设备),高效部署LLMs仍是一大挑战。仅权重量化是一种有前景的方法,但低于4比特的量化仍因激活值中的大幅异常值而面临困难。为缓解这一不良异常值效应,我们首先提出逐输入通道(per-IC)量化,这是一种简单而有效的方法,它在每个输入通道内创建量化分组,而非传统的逐输出通道(per-OC)方式。我们的方法基于以下观察:激活值异常主要影响权重矩阵的输入维度,因此在IC方向上对权重进行类似分组可以将异常值隔离在组内。我们还发现激活值异常并非决定量化难度的唯一因素,权重本身也存在敏感性。基于per-IC量化这一新的异常值友好方案,我们提出自适应维度(AdaDim),这是一种通用的量化框架,能够适应各种权重敏感性模式。通过增强先前方法(如Round-To-Nearest和GPTQ),我们证明了AdaDim的有效性,在基础模型(MMLU上提升高达+4.7%)和指令微调模型(HumanEval上提升高达+10%)的各种语言建模基准中均显示出显著改进。代码已开源在https://github.com/johnheo/adadim-llm。