Large Language Models (LLMs) have recently demonstrated a remarkable success across various tasks. However, efficiently serving LLMs has been a challenge due to its large memory bottleneck, specifically in small batch inference settings (e.g. mobile devices). Weight-only quantization can be a promising approach, but sub-4 bit quantization remains a challenge due to large-magnitude activation outliers. To mitigate the undesirable outlier effect, we first propose per-IC quantization, a simple yet effective method that creates quantization groups within each input channel (IC) rather than the conventional per-output channel (OC). Our method is motivated by the observation that activation outliers affect the input dimension of the weight matrix, so similarly grouping the weights in the IC direction can isolate outliers to be within a group. We also find that activation outliers do not dictate quantization difficulty, and inherent weight sensitivities also exist. With per-IC quantization as a new outlier-friendly scheme, we then propose Adaptive Dimensions (AdaDim), a versatile quantization framework that can adapt to various weight sensitivity patterns. We demonstrate the effectiveness of AdaDim by augmenting prior methods such as Round-To-Nearest and GPTQ, showing significant improvements across various language modeling benchmarks for both base (up to +4.7% on MMLU) and instruction-tuned (up to +10% on HumanEval) LLMs.
翻译:大型语言模型(LLMs)近期在各项任务中展现出显著的成功。然而,由于巨大的内存瓶颈,特别是在小批量推理场景(如移动设备)中,高效部署LLMs仍面临挑战。仅权重量化是一种有前途的方法,但由于激活值存在大幅度的异常值,低于4位的量化仍具有挑战性。为减轻这种不良的异常值效应,我们首先提出逐输入通道(per-IC)量化——一种简单而有效的方法,该方法在每条输入通道内创建量化分组,而非传统的逐输出通道(per-OC)分组。该方法的动机源于观察到激活异常值影响权重矩阵的输入维度,因此在IC方向上对权重进行分组可以将异常值隔离在组内。我们还发现激活异常值并非决定量化难度的唯一因素,权重本身也存在固有的敏感性。基于per-IC量化这一利于处理异常值的新方案,我们进一步提出自适应维度(AdaDim)——一种能够适应多种权重敏感性模式的通用量化框架。通过增强最近邻取整(Round-To-Nearest)和GPTQ等现有方法,我们在基础型LLM(MMLU基准上提升最高达+4.7%)和指令微调型LLM(HumanEval基准上提升最高达+10%)的各种语言建模基准测试中,均验证了AdaDim的有效性。