Quantization is the key method for reducing inference latency, power and memory footprint of generative AI models. However, accuracy often degrades sharply when activations are quantized below eight bits. Recent work suggests that invertible linear transformations (e.g. rotations) can aid quantization, by reparameterizing feature channels and weights. In this paper, we propose \textit{Sequence Transformation and Mixed Precision} (STaMP) quantization, a novel strategy that applies linear transformations along the \textit{sequence} dimension to exploit the strong local correlation in language and visual data. By keeping a small number of tokens in each intermediate activation at higher precision, we can maintain model accuracy at lower (average) activations bit-widths. We evaluate STaMP on recent LVM and LLM architectures, demonstrating that it significantly improves low bit width activation quantization and complements established activation and weight quantization methods including recent feature transformations.
翻译:量化是降低生成式AI模型推理延迟、功耗和内存占用的关键技术。然而,当激活值被量化至8比特以下时,模型精度往往急剧下降。近期研究表明,可逆线性变换(如旋转变换)可通过重新参数化特征通道和权重来辅助量化。本文提出一种新颖的量化策略——序列变换与混合精度量化,该方法沿序列维度施加线性变换,以利用语言和视觉数据中存在的强局部相关性。通过在每个中间激活层中保留少量高精度令牌,我们能够在较低(平均)激活位宽下维持模型精度。我们在最新的LVM和LLM架构上评估STaMP方法,实验证明该方法显著改善了低比特激活量化效果,并与现有激活/权重量化方法(包括近期提出的特征变换技术)形成有效互补。