Large language models (LLMs) have revolutionized numerous applications, yet their deployment remains challenged by memory constraints on local devices. While scaling laws have enhanced LLM capabilities, the primary bottleneck has shifted from \textit{capability} to \textit{availability}, emphasizing the need for efficient memory management. Traditional compression methods, such as quantization, often require predefined compression ratios and separate compression processes for each setting, complicating deployment in variable memory environments. In this paper, we introduce \textbf{BitStack}, a novel, training-free weight compression approach that enables megabyte-level trade-offs between memory usage and model performance. By leveraging weight decomposition, BitStack can dynamically adjust the model size with minimal transmission between running memory and storage devices. Our approach iteratively decomposes weight matrices while considering the significance of each parameter, resulting in an approximately 1-bit per parameter residual block in each decomposition iteration. These blocks are sorted and stacked in storage as basic transmission units, with different quantities loaded based on current memory availability. Extensive experiments across a wide range of tasks demonstrate that, despite offering fine-grained size control, BitStack consistently matches or surpasses strong quantization baselines, particularly at extreme compression ratios. To the best of our knowledge, this is the first decomposition-based method that effectively bridges the gap to practical compression techniques like quantization. Code is available at https://github.com/xinghaow99/BitStack.
翻译:大型语言模型(LLMs)已彻底变革众多应用领域,但其在本地设备上的部署仍受限于内存约束。尽管缩放定律提升了LLM的能力,主要瓶颈已从\textit{能力}转向\textit{可用性},凸显了高效内存管理的必要性。传统压缩方法(如量化)通常需要预定义压缩比,并为每种设置单独执行压缩流程,这在可变内存环境中使部署复杂化。本文提出\textbf{BitStack},一种无需训练的新型权重压缩方法,可在内存使用与模型性能之间实现兆字节级别的权衡。通过权重分解,BitStack能够以运行内存与存储设备间的最小传输量动态调整模型尺寸。该方法在考虑各参数重要性的同时迭代分解权重矩阵,使每次分解迭代产生约每参数1比特的残差块。这些块在存储中作为基本传输单元被排序堆叠,并根据当前内存可用性加载不同数量。在广泛任务上的大量实验表明,尽管提供细粒度尺寸控制,BitStack始终匹配或超越强量化基线方法,在极端压缩比下表现尤为突出。据我们所知,这是首个基于分解的方法,能有效弥合与量化等实用压缩技术间的差距。代码发布于https://github.com/xinghaow99/BitStack。