Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) solver to precisely initialize latent binary matrices and scales, and then tunes the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, and enables sub-1-bit compression. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU. Code is available at https://github.com/SamsungLabs/NanoQuant.
翻译:权重量化已成为高效部署大语言模型的标准方法。然而,现有方法无法有效将模型压缩至二进制(1比特)层级,因其或需消耗大量数据与计算资源,或导致额外存储开销。本文提出NanoQuant——首个能将大语言模型压缩至二进制及亚1比特层级的训练后量化方法。NanoQuant将量化问题建模为低秩二进制分解,将全精度权重压缩为低秩二进制矩阵与缩放因子。具体而言,该方法利用高效的交替方向乘子法求解器精确初始化潜在二进制矩阵与缩放因子,随后通过逐块与逐模型重构过程对初始化参数进行调优。由此,NanoQuant在低内存训练后量化领域建立了新的帕累托前沿,并实现了亚1比特压缩。该方法使大规模部署在消费级硬件上成为可能。例如,在单个H100 GPU上,NanoQuant仅用13小时便将Llama2-70B压缩25.8倍,使70B模型得以在8GB显存的消费级GPU上运行。代码已开源至https://github.com/SamsungLabs/NanoQuant。