3BASiL: An Algorithmic Framework for Sparse plus Low-Rank Compression of LLMs

Sparse plus Low-Rank $(\mathbf{S} + \mathbf{LR})$ decomposition of Large Language Models (LLMs) has emerged as a promising direction in model compression, aiming to decompose pre-trained model weights into a sum of sparse and low-rank matrices $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$. Despite recent progress, existing methods often suffer from substantial performance degradation compared to dense models. In this work, we introduce 3BASiL-TM, an efficient one-shot post-training method for $(\mathbf{S} + \mathbf{LR})$ decomposition of LLMs that addresses this gap. Our approach first introduces a novel 3-Block Alternating Direction Method of Multipliers (ADMM) method, termed 3BASiL, to minimize the layer-wise reconstruction error with convergence guarantees. We then design an efficient transformer-matching (TM) refinement step that jointly optimizes the sparse and low-rank components across transformer layers. This step minimizes a novel memory-efficient loss that aligns outputs at the transformer level. Notably, the TM procedure is universal as it can enhance any $(\mathbf{S} + \mathbf{LR})$ decomposition, including pure sparsity. Our numerical experiments show that 3BASiL-TM reduces the WikiText2 perplexity gap relative to dense LLaMA-8B model by over 30% under a (2:4 Sparse + 64 LR) configuration, compared to prior methods. Moreover, our method achieves over 2.5x faster compression runtime on an A100 GPU compared to SOTA $(\mathbf{S} + \mathbf{LR})$ method. Our code is available at https://github.com/mazumder-lab/3BASiL.

翻译：大语言模型（LLMs）的稀疏加低秩 $(\mathbf{S} + \mathbf{LR})$ 分解已成为模型压缩中一个前景广阔的方向，其目标是将预训练模型权重分解为稀疏矩阵与低秩矩阵之和 $(\mathbf{W} \approx \mathbf{S} + \mathbf{LR})$。尽管近期取得了一些进展，但与稠密模型相比，现有方法通常存在显著的性能下降。在本工作中，我们提出了3BASiL-TM，一种用于LLMs $(\mathbf{S} + \mathbf{LR})$ 分解的高效一次性后训练方法，旨在解决这一差距。我们的方法首先引入了一种新颖的3块交替方向乘子法（ADMM），称为3BASiL，以最小化逐层重构误差并保证收敛性。随后，我们设计了一个高效的Transformer匹配（TM）精炼步骤，该步骤联合优化跨Transformer层的稀疏与低秩分量。此步骤通过最小化一种新颖的、内存高效的损失函数，在Transformer层级对齐输出。值得注意的是，TM过程具有通用性，因为它可以增强任何 $(\mathbf{S} + \mathbf{LR})$ 分解，包括纯稀疏分解。我们的数值实验表明，在（2:4稀疏 + 64低秩）配置下，与先前方法相比，3BASiL-TM将相对于稠密LLaMA-8B模型的WikiText2困惑度差距降低了超过30%。此外，与最先进的 $(\mathbf{S} + \mathbf{LR})$ 方法相比，我们的方法在A100 GPU上的压缩运行时间实现了超过2.5倍的加速。我们的代码可在 https://github.com/mazumder-lab/3BASiL 获取。