Multi-Bitwidth Quantization for LLMs Using Additive Codebooks

As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.

翻译：随着大语言模型（LLM）在具有不同资源约束的异构硬件上日益广泛部署，在无需重新训练的情况下自适应地管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop，一种新颖的多比特宽度训练后量化框架，该框架能够从单个训练模型中实现对LLM权重的推理时精度控制。我们的方法在信息论和逐次细化理论中具有理论基础。我们证明，在由LLM损失函数驱动的加权均方误差失真度量下，通常服从高斯分布的LLM权重可以在增加额外比特时以递增保真度最优重建。为了在实践中实现这一点，Drop-by-Drop将马特罗什卡式监督纳入损失函数，充分利用加性码本的结构。Drop-by-Drop生成单个模型，其中有序码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度，显著降低了存储和内存开销，同时在Qwen、LLaMA、Gemma和Mistral等主流架构上保持有竞争力的困惑度和准确率。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

大语言模型智能体（LLM Agents）工具调用的演进：从单工具调用到多工具协同编排

专知会员服务

29+阅读 · 4月6日

【新书】设计大型语言模型应用：一种面向LLMs的整体方法

专知会员服务

56+阅读 · 2025年3月16日

带入您自己的知识：大型语言模型（LLM）知识扩展方法综述

专知会员服务

38+阅读 · 2025年2月21日

使用多模态大语言模型进行深度学习的图像、文本和语音数据增强：综述

专知会员服务

28+阅读 · 2025年2月4日