Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.
翻译:仅权重量化已成为高效服务大语言模型(LLM)的标准方法。然而,现有方法无法将模型高效压缩至二进制(1位)水平,因为它们要么需要大量数据和计算资源,要么会产生额外存储开销。本研究提出NanoQuant,这是首个将LLM压缩至二进制及亚1位水平的训练后量化(PTQ)方法。NanoQuant将量化问题构建为低秩二进制分解问题,将全精度权重压缩为低秩二进制矩阵与缩放因子。该方法采用高效的交替方向乘子法(ADMM)精确初始化潜在二进制矩阵与缩放因子,随后通过模块级与模型级重建过程对初始化参数进行微调。因此,NanoQuant在低内存训练后量化领域建立了新的帕累托前沿,即使在亚1位压缩率下仍能实现最优精度。NanoQuant使得在消费级硬件上部署大规模模型成为可能:例如,该方法在单张H100显卡上仅用13小时即可将Llama2-70B压缩25.8倍,使700亿参数模型能在8GB显存的消费级GPU上运行。