Large Language Models (LLMs) are proficient in natural language processing tasks, but their deployment is often restricted by extensive parameter sizes and computational demands. This paper focuses on post-training quantization (PTQ) in LLMs, specifically 4-bit weight and 8-bit activation (W4A8) quantization, to enhance computational efficiency -- a topic less explored compared to weight-only quantization. We present two innovative techniques: activation-quantization-aware scaling (AQAS) and sequence-length-aware calibration (SLAC) to enhance PTQ by considering the combined effects on weights and activations and aligning calibration sequence lengths to target tasks. Moreover, we introduce dINT, a hybrid data format combining integer and denormal representations, to address the underflow issue in W4A8 quantization, where small values are rounded to zero. Through rigorous evaluations of LLMs, including OPT and LLaMA, we demonstrate that our techniques significantly boost task accuracies to levels comparable with full-precision models. By developing arithmetic units compatible with dINT, we further confirm that our methods yield a 2$\times$ hardware efficiency improvement compared to 8-bit integer MAC unit.
翻译:大规模语言模型(LLMs)在自然语言处理任务中表现卓越,但其部署常受限于庞大的参数量与计算需求。本文聚焦于LLMs中的训练后量化(PTQ),特别是4位权重与8位激活(W4A8)量化,以提升计算效率——该方向相较于仅权重量化而言研究尚不充分。我们提出了两项创新技术:激活量化感知缩放(AQAS)与序列长度感知校准(SLAC),通过综合考虑权重与激活的联合影响,并将校准序列长度与目标任务对齐,以增强PTQ效果。此外,我们引入了dINT——一种结合整数与非正规表示的混合数据格式,以解决W4A8量化中因小数值被舍入为零而导致的数值下溢问题。通过对OPT和LLaMA等LLMs的严格评估,我们证明所提技术能将任务精度显著提升至与全精度模型相当的水平。通过开发兼容dINT的算术单元,我们进一步验证了该方法相比8位整数乘加单元可实现2倍的硬件效率提升。