MobileQuant: Mobile-friendly Quantization for On-device Language Models

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\%-50\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

翻译：大型语言模型（LLMs）已彻底改变了语言处理领域，在多种应用中取得了卓越成果。然而，在边缘设备上部署LLMs面临着内存、能耗和计算成本等多重挑战，限制了其在移动电话等设备中的广泛应用。一种具有前景的解决方案是减少用于表示权重和激活值的比特数。现有研究在将LLMs量化至较低比特宽度（例如4位权重）方面已取得部分成功，但将激活值量化至16位以下时，常因设备端量化支持不足而导致巨大的计算开销，或引发显著的精度下降。然而，8位激活值对于设备端部署极具吸引力，因其能使LLMs充分利用移动友好型硬件（如神经处理单元NPU）。本研究首次尝试通过纯整数量化促进LLMs在设备端的部署。我们首先剖析了现有量化方法在设备端部署中的局限性，特别聚焦于激活值量化。随后，我们通过引入一种简单的训练后量化方法——MobileQuant——来应对这些局限。该方法扩展了先前的权重等效变换研究，以端到端方式联合优化权重变换与激活值范围参数。MobileQuant展现出超越现有方法的卓越能力：1）在广泛的LLM基准测试中实现接近无损的量化；2）与当前设备端量化策略相比，延迟和能耗降低20%-50%；3）所需计算预算有限；4）兼容移动友好型计算单元（如NPU）。