Large Language Models (LLMs) have demonstrated remarkable success across a wide range of language tasks, but their deployment on edge devices remains challenging due to the substantial memory requirements imposed by their large parameter sizes. Weight-only quantization presents a promising solution to reduce the memory footprint of LLMs. However, existing approaches primarily focus on integer-bit quantization, limiting their adaptability to fractional-bit quantization tasks and preventing the full utilization of available storage space on devices. In this paper, we introduce Channel-Wise Mixed-Precision Quantization (CMPQ), a novel mixed-precision quantization method that allocates quantization precision in a channel-wise pattern based on activation distributions. By assigning different precision levels to different weight channels, CMPQ can adapt to any bit-width constraint. CMPQ employs a non-uniform quantization strategy and incorporates two outlier extraction techniques that collaboratively preserve the critical information, thereby minimizing the quantization loss. Experiments on different sizes of LLMs demonstrate that CMPQ not only enhances performance in integer-bit quantization tasks but also achieves significant performance gains with a modest increase in memory usage. CMPQ thus represents an adaptive and effective approach to LLM quantization, offering substantial benefits across diverse device capabilities.
翻译:大型语言模型(LLMs)在广泛的语言任务中展现出卓越的性能,但由于其庞大的参数量带来的巨大内存需求,在边缘设备上的部署仍然面临挑战。仅权重量化是减少LLMs内存占用的一种有前景的解决方案。然而,现有方法主要集中于整数位量化,限制了其对分数位量化任务的适应性,并且无法充分利用设备上可用的存储空间。本文提出了一种新颖的混合精度量化方法——通道级混合精度量化(CMPQ),该方法根据激活分布以通道级模式分配量化精度。通过对不同的权重通道分配不同的精度级别,CMPQ可以适应任意的位宽约束。CMPQ采用非均匀量化策略,并融合了两种异常值提取技术,协同保留关键信息,从而最小化量化损失。在不同规模的LLMs上进行的实验表明,CMPQ不仅提升了整数位量化任务的性能,而且在内存使用量适度增加的情况下实现了显著的性能增益。因此,CMPQ代表了一种自适应且有效的LLM量化方法,为不同设备能力提供了实质性的益处。