Modern recurrent layers are emerging as a promising path toward edge deployment of foundation models, especially in the context of large language models (LLMs). Compressing the whole input sequence in a finite-dimensional representation enables recurrent layers to model long-range dependencies while maintaining a constant inference cost for each token and a fixed memory requirement. However, the practical deployment of LLMs in resource-limited environments often requires further model compression, such as quantization and pruning. While these techniques are well-established for attention-based models, their effects on recurrent layers remain underexplored. In this preliminary work, we focus on post-training quantization for recurrent LLMs and show that Mamba models exhibit the same pattern of outlier channels observed in attention-based LLMs. We show that the reason for the difficulty of quantizing SSMs is caused by activation outliers, similar to those observed in transformer-based LLMs. We report baseline results for post-training quantization of Mamba that do not take into account the activation outliers and suggest first steps for outlier-aware quantization.
翻译:现代循环层正成为基础模型边缘部署的一条有前景的路径,尤其在大语言模型(LLMs)的背景下。通过将整个输入序列压缩到有限维表示中,循环层能够在保持每个令牌的恒定推理成本和固定内存需求的同时,对长程依赖关系进行建模。然而,在资源受限的环境中实际部署LLMs通常需要进一步的模型压缩,例如量化和剪枝。尽管这些技术对于基于注意力的模型已较为成熟,但它们对循环层的影响仍未得到充分探索。在这项初步工作中,我们专注于循环LLMs的训练后量化,并证明Mamba模型表现出与基于注意力的LLMs中观察到的相同的异常通道模式。我们表明,量化SSMs的困难源于激活异常值,这与在基于Transformer的LLMs中观察到的异常值类似。我们报告了未考虑激活异常值的Mamba训练后量化的基线结果,并提出了面向异常值感知量化的初步步骤。