Quantization is a proven effective method for compressing large language models. Although popular techniques like W8A8 and W4A16 effectively maintain model performance, they often fail to concurrently speed up the prefill and decoding stages of inference. W4A8 is a promising strategy to accelerate both of them while usually leads to a significant performance degradation. To address these issues, we present QQQ, a Quality Quattuor-bit Quantization method with 4-bit weights and 8-bit activations. QQQ employs adaptive smoothing and Hessian-based compensation, significantly enhancing the performance of quantized models without extensive training. Furthermore, we meticulously engineer W4A8 GEMM kernels to increase inference speed. Our specialized per-channel W4A8 GEMM and per-group W4A8 GEMM achieve impressive speed increases of 3.67$\times$ and 3.29 $\times$ over FP16 GEMM. Our extensive experiments show that QQQ achieves performance on par with existing state-of-the-art LLM quantization methods while significantly accelerating inference, achieving speed boosts up to 2.24 $\times$, 2.10$\times$, and 1.25$\times$ compared to FP16, W8A8, and W4A16, respectively.
翻译:量化是一种经过验证的、用于压缩大语言模型的有效方法。尽管诸如W8A8和W4A16等流行技术能有效维持模型性能,但它们通常无法同时加速推理的预填充和解码阶段。W4A8是一种有望同时加速这两个阶段的有前景的策略,但通常会导致显著的性能下降。为解决这些问题,我们提出了QQQ,一种采用4比特权重和8比特激活的高质量四比特量化方法。QQQ采用自适应平滑和基于Hessian矩阵的补偿,无需大量训练即可显著提升量化模型的性能。此外,我们精心设计了W4A8 GEMM内核以提高推理速度。我们专门设计的逐通道W4A8 GEMM和逐组W4A8 GEMM相比FP16 GEMM分别实现了3.67$\times$和3.29$\times$的显著速度提升。我们的大量实验表明,QQQ在性能上与现有最先进的大语言模型量化方法相当,同时显著加速了推理,与FP16、W8A8和W4A16相比,分别实现了高达2.24$\times$、2.10$\times$和1.25$\times$的速度提升。