Agile-Quant: Activation-Guided Quantization for Faster Inference of LLMs on the Edge

Large Language Models (LLMs) stand out for their impressive performance in intricate language modeling tasks. However, their demanding computational and memory needs pose obstacles for broad use on edge devices. Quantization is then introduced to boost LLMs' on-device efficiency. Recent works show that 8-bit or lower weight quantization is feasible with minimal impact on end-to-end task performance, while the activation is still not quantized. On the other hand, mainstream commodity edge devices still struggle to execute these sub-8-bit quantized networks effectively. In this paper, we propose Agile-Quant, an activation-guided quantization framework for popular Large Language Models (LLMs), and implement an end-to-end accelerator on multiple edge devices for faster inference. Considering the hardware profiling and activation analysis, we first introduce a basic activation quantization strategy to balance the trade-off of task performance and real inference speed. Then we leverage the activation-aware token pruning technique to reduce the outliers and the adverse impact on attentivity. Ultimately, we utilize the SIMD-based 4-bit multiplier and our efficient TRIP matrix multiplication to implement the accelerator for LLMs on the edge. We apply our framework on different scales of LLMs including LLaMA, OPT, and BLOOM with 4-bit or 8-bit for the activation and 4-bit for the weight quantization. Experiments show that Agile-Quant achieves simultaneous quantization of model weights and activations while maintaining task performance comparable to existing weight-only quantization methods. Moreover, in the 8- and 4-bit scenario, Agile-Quant achieves an on-device speedup of up to 2.55x compared to its FP16 counterparts across multiple edge devices, marking a pioneering advancement in this domain.

翻译：大型语言模型（LLMs）在复杂语言建模任务中表现出卓越性能，但其高昂的计算与内存需求阻碍了其在边缘设备上的广泛应用。为此，量化技术被引入以提升LLMs的设备端效率。现有研究表明，8位或更低比特的权重量化对端到端任务性能影响甚微，但激活值仍未实现量化。另一方面，主流商用边缘设备在执行低于8位量化的网络时仍面临困难。本文提出Agile-Quant——一种面向主流大型语言模型（LLMs）的激活引导量化框架，并在多种边缘设备上实现端到端加速器以支持快速推理。结合硬件性能分析与激活值特征研究，我们首先引入基础激活量化策略，以平衡任务性能与实际推理速度间的权衡；进而利用激活感知的令牌剪枝技术减少异常值及其对注意力机制的负面影响；最终采用基于SIMD的4位乘法器与高效的TRIP矩阵乘法，在边缘设备上实现LLMs加速器。我们将该框架应用于不同规模的LLMs（包括LLaMA、OPT和BLOOM），激活值采用4位或8位量化，权重量化为4位。实验表明，Agile-Quant在同时量化模型权重与激活值的同时，能保持与纯权重量化方法相当的任务性能。此外，在8位与4位量化场景下，Agile-Quant在多种边缘设备上相较于FP16基准实现了高达2.55倍的设备端加速，标志着该领域的开创性进展。