Powerful large language models (LLMs) are increasingly expected to be deployed with lower computational costs, enabling their capabilities on resource-constrained devices. Post-training quantization (PTQ) has emerged as a star approach to achieve this ambition, with best methods compressing weights to less than 2 bit on average. In this paper, we propose Channel-Relaxed Vector Quantization (CRVQ), a novel technique that significantly improves the performance of PTQ baselines at the cost of only minimal additional bits. This state-of-the-art extreme compression method achieves its results through two key innovations: (1) carefully selecting and reordering a very small subset of critical weight channels, and (2) leveraging multiple codebooks to relax the constraint of critical channels. With our method, we demonstrate a 38.9% improvement over the current strongest sub-2-bit PTQ baseline, enabling nearer lossless 1-bit compression. Furthermore, our approach offers flexible customization of quantization bit-width and performance, providing a wider range of deployment options for diverse hardware platforms.
翻译:功能强大的大语言模型(LLMs)正日益需要在更低计算成本下部署,以实现在资源受限设备上运行其能力。训练后量化(PTQ)已成为实现这一目标的主流方法,其中最佳方案可将权重平均压缩至低于2比特。本文提出通道松弛向量量化(CRVQ),这是一种创新技术,仅需消耗极少额外比特即可显著提升PTQ基线的性能。这一最先进的极限压缩方法通过两项关键创新实现其成果:(1)精心筛选并重排极少数关键权重通道;(2)利用多码本松弛关键通道的约束。通过本方法,我们在当前最强的亚2比特PTQ基线上实现了38.9%的性能提升,从而达成更接近无损的1比特压缩。此外,我们的方法支持量化比特宽度与性能的灵活定制,为多样化硬件平台提供了更广泛的部署选择。