Adapting large pre-trained language models to downstream tasks often entails fine-tuning millions of parameters or deploying costly dense weight updates, which hinders their use in resource-constrained environments. Low-rank Adaptation (LoRA) reduces trainable parameters by factorizing weight updates, yet the underlying dense weights still impose high storage and computation costs. Magnitude-based pruning can yield sparse models but typically degrades LoRA's performance when applied naively. In this paper, we introduce SALR (Sparsity-Aware Low-Rank Representation), a novel fine-tuning paradigm that unifies low-rank adaptation with sparse pruning under a rigorous mean-squared-error framework. We prove that statically pruning only the frozen base weights minimizes the pruning error bound, and we recover the discarded residual information via a truncated-SVD low-rank adapter, which provably reduces per-entry MSE by a factor of $(1 - r/\min(d,k))$. To maximize hardware efficiency, we fuse multiple low-rank adapters into a single concatenated GEMM, and we adopt a bitmap-based encoding with a two-stage pipelined decoding + GEMM design to achieve true model compression and speedup. Empirically, SALR attains 50\% sparsity on various LLMs while matching the performance of LoRA on GSM8K and MMLU, reduces model size by $2\times$, and delivers up to a $1.7\times$ inference speedup.
翻译:将大规模预训练语言模型适配至下游任务通常需要微调数百万参数或部署代价高昂的密集权重更新,这限制了其在资源受限环境中的应用。低秩适配(LoRA)通过分解权重更新来减少可训练参数量,但其底层密集权重仍带来高昂的存储与计算开销。基于幅度的剪枝可生成稀疏模型,但若直接应用于LoRA通常会损害其性能。本文提出SALR(稀疏感知低秩表示)——一种在严格均方误差框架下统一低秩适配与稀疏剪枝的新型微调范式。我们证明仅静态剪枝冻结的基权重可最小化剪枝误差界,并通过截断奇异值分解的低秩适配器恢复被丢弃的残差信息,该方法可证明将逐项均方误差降低$(1 - r/\min(d,k))$倍。为最大化硬件效率,我们将多个低秩适配器融合为单个拼接通用矩阵乘法,并采用基于位图的编码与两级流水线解码+通用矩阵乘法设计,以实现真正的模型压缩与加速。实验表明,SALR在多种大语言模型上实现50%的稀疏度,同时在GSM8K和MMLU基准上达到与LoRA相当的性能,模型尺寸减少$2\times$,推理速度最高提升$1.7\times$。