The deployment of large language models (LLMs) is frequently hindered by prohibitive memory and computational requirements. While quantization mitigates these bottlenecks, maintaining model fidelity in the sub-1-bit regime remains a persistent challenge. In this paper, we introduce LittleBit, a novel framework for extreme LLM compression. We target quantization rates as low as $0.1$ bits per weight (BPW), achieving a memory reduction of approximately $31\times$, which effectively compresses Llama2-13B to under $0.9$ GB. We represent weights via low-rank latent matrix factorization and subsequently binarize the resulting factors. To counteract the information loss inherent to such drastic precision reduction, we integrate a multi-scale compensation mechanism that learns importance parameters across row, column, and latent dimensions. Two primary contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and Residual Compensation to minimize approximation errors. Extensive experiments confirm the superiority of LittleBit in the sub-1-bit domain; for instance, our method at $0.1$ BPW surpasses the performance of leading techniques operating at $0.7$ BPW on Llama2-7B. We establish a new size-performance trade-off -- unlocking a potential $11.6\times$ inference speedup relative to FP16 -- and render powerful LLMs practical for resource-constrained environments. Our code is available at https://github.com/SamsungLabs/LittleBit.
翻译:大型语言模型(LLM)的部署常因高昂的内存与计算需求而受阻。量化方法虽能缓解此类瓶颈,但在亚1比特(sub-1-bit)区间保持模型保真度仍是一项持续挑战。本文提出LittleBit,一种用于极端LLM压缩的新框架。我们以低至每权重$0.1$比特(BPW)的量化率为目标,实现了约$31\times$的内存压缩,从而将Llama2-13B模型有效压缩至$0.9$ GB以下。我们通过低秩隐矩阵分解表示权重,并对所得因子进行二值化。为抵消此类极端精度降低带来的固有信息损失,我们引入了一种多尺度补偿机制,该机制可学习行、列及隐维度上的重要性参数。两项核心贡献确保了训练的有效性:用于量化感知训练(QAT)初始化的双符号-值独立分解(Dual-SVID),以及用于最小化近似误差的残差补偿。大量实验证实了LittleBit在亚1比特领域的优越性;例如,在Llama2-7B上,我们以$0.1$ BPW实现的方法性能超越了当前领先的$0.7$ BPW技术。我们建立了一种新的尺寸-性能权衡关系——相较于FP16实现了潜在的$11.6\times$推理加速——使得强大LLM在资源受限环境中具备实际部署可行性。代码已发布于https://github.com/SamsungLabs/LittleBit。