Large Language Models (LLMs) have recently demonstrated remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to the large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output-channel (per-OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs. Code is available at https://github.com/johnheo/adadim-llm
翻译:大语言模型(LLMs)近期在各种任务中展现出卓越的性能。然而,由于巨大的内存瓶颈,特别是在小批量推理场景(例如移动设备)中,高效部署LLMs一直面临挑战。仅权重量化是一种有前景的方法,但由于激活值中存在大范围异常值,低于4比特的量化仍具挑战性。为减轻异常值的不良影响,我们首先提出每输入通道量化,这是一种简单而有效的方法,它在每个输入通道内创建量化组,而非传统的每输出通道量化。我们的方法基于以下观察:激活异常值影响权重矩阵的输入维度,因此在输入通道方向对权重进行分组可将异常值隔离在组内。我们还发现激活异常值并非决定量化难度的唯一因素,权重本身也存在固有的敏感性。以每输入通道量化作为新的异常值友好方案为基础,我们提出自适应维度,这是一个通用的量化框架,能够适应各种权重敏感性模式。我们通过增强现有方法(如最近邻舍入和GPTQ)证明了AdaDim的有效性,在多种语言建模基准测试中,对于基础LLMs(在MMLU上最高提升+4.7%)和指令微调LLMs(在HumanEval上最高提升+10%)均显示出显著改进。代码可在 https://github.com/johnheo/adadim-llm 获取。