As large language models (LLMs) are increasingly deployed across heterogeneous hardware with varying resource constraints, the ability to adaptively manage the trade-off between performance and efficiency without retraining is critical. We propose Drop-by-Drop, a novel multi-bitwidth post-training quantization framework that enables inference-time precision control over LLM weights from a single trained model. Our method is theoretically grounded in information theory and successive refinement. We establish that LLM weights, which commonly follow a Gaussian distribution, can be optimally reconstructed with increasing fidelity as additional bits are incorporated, under a weighted mean squared error distortion motivated by LLM loss functions. To realize this in practice, Drop-by-Drop incorporates Matryoshka-style supervision into the loss function, exploiting the structure of additive codebooks. Drop-by-Drop produces a single model where ordered subsets of codebooks yield accurate partial reconstructions at each precision level. This approach significantly reduces storage and memory overhead by allowing a single checkpoint to serve multiple bitwidths, while maintaining competitive perplexity and accuracy across major architectures, such as Qwen, LLaMA, Gemma, and Mistral.
翻译:随着大语言模型(LLM)在具有不同资源约束的异构硬件上日益广泛部署,在无需重新训练的情况下自适应地管理性能与效率之间权衡的能力变得至关重要。我们提出Drop-by-Drop,一种新颖的多比特宽度训练后量化框架,该框架能够从单个训练模型中实现对LLM权重的推理时精度控制。我们的方法在信息论和逐次细化理论中具有理论基础。我们证明,在由LLM损失函数驱动的加权均方误差失真度量下,通常服从高斯分布的LLM权重可以在增加额外比特时以递增保真度最优重建。为了在实践中实现这一点,Drop-by-Drop将马特罗什卡式监督纳入损失函数,充分利用加性码本的结构。Drop-by-Drop生成单个模型,其中有序码本子集在每个精度级别产生精确的部分重建。这种方法通过允许单个检查点服务于多个比特宽度,显著降低了存储和内存开销,同时在Qwen、LLaMA、Gemma和Mistral等主流架构上保持有竞争力的困惑度和准确率。