Sparse plus Low-Rank $(\mathbf{S} + \mathbf{LR})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{LR})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer-wise reconstruction error with convergence guarantees. We then design an efficient transformer-matching (TM) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{LR})$ decomposition, including pure sparsity. Our numerical experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{LR})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.
翻译:大语言模型(LLMs)的稀疏加低秩 $(\mathbf{S} + \mathbf{LR})$ 分解已成为模型压缩中一个前景广阔的方向,其目标是将预训练模型权重分解为稀疏矩阵与低秩矩阵之和 $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$。尽管近期取得了一些进展,但与稠密模型相比,现有方法通常存在显著的性能下降。在本工作中,我们提出了3BASiL-TM,一种用于LLMs $(\mathbf{S} + \mathbf{LR})$ 分解的高效一次性后训练方法,旨在解决这一差距。我们的方法首先引入了一种新颖的3块交替方向乘子法(ADMM),称为3BASiL,以最小化逐层重构误差并保证收敛性。随后,我们设计了一个高效的Transformer匹配(TM)精炼步骤,该步骤联合优化跨Transformer层的稀疏与低秩分量。此步骤通过最小化一种新颖的、内存高效的损失函数,在Transformer层级对齐输出。值得注意的是,TM过程具有通用性,因为它可以增强任何 $(\mathbf{S} + \mathbf{LR})$ 分解,包括纯稀疏分解。我们的数值实验表明,在(2:4稀疏 + 64低秩)配置下,与先前方法相比,3BASiL-TM将相对于稠密LLaMA-8B模型的WikiText2困惑度差距降低了超过30%。此外,与最先进的 $(\mathbf{S} + \mathbf{LR})$ 方法相比,我们的方法在A100 GPU上的压缩运行时间实现了超过2.5倍的加速。我们的代码可在 https://github.com/mazumder-lab/3BASiL 获取。