Despite the efficacy of network sparsity in alleviating the deployment strain of Large Language Models (LLMs), it endures significant performance degradation. Applying Low-Rank Adaptation (LoRA) to fine-tune the sparse LLMs offers an intuitive approach to counter this predicament, while it holds shortcomings include: 1) The inability to integrate LoRA weights into sparse LLMs post-training, and 2) Insufficient performance recovery at high sparsity ratios. In this paper, we introduce dynamic Low-rank Sparse Adaptation (LoSA), a novel method that seamlessly integrates low-rank adaptation into LLM sparsity within a unified framework, thereby enhancing the performance of sparse LLMs without increasing the inference latency. In particular, LoSA dynamically sparsifies the LoRA outcomes based on the corresponding sparse weights during fine-tuning, thus guaranteeing that the LoRA module can be integrated into the sparse LLMs post-training. Besides, LoSA leverages Representation Mutual Information (RMI) as an indicator to determine the importance of layers, thereby efficiently determining the layer-wise sparsity rates during fine-tuning. Predicated on this, LoSA adjusts the rank of the LoRA module based on the variability in layer-wise reconstruction errors, allocating an appropriate fine-tuning for each layer to reduce the output discrepancies between dense and sparse LLMs. Extensive experiments tell that LoSA can efficiently boost the efficacy of sparse LLMs within a few hours, without introducing any additional inferential burden. For example, LoSA reduced the perplexity of sparse LLaMA-2-7B by 68.73 and increased zero-shot accuracy by 16.32$\%$, achieving a 2.60$\times$ speedup on CPU and 2.23$\times$ speedup on GPU, requiring only 45 minutes of fine-tuning on a single NVIDIA A100 80GB GPU. Code is available at https://github.com/wzhuang-xmu/LoSA.
翻译:尽管网络稀疏化能够有效缓解大语言模型(LLMs)的部署压力,但其性能往往显著下降。应用低秩适配(LoRA)对稀疏LLMs进行微调是应对此困境的直观方法,但该方法存在以下不足:1)训练后无法将LoRA权重整合至稀疏LLMs中;2)在高稀疏比下性能恢复不足。本文提出动态低秩稀疏适配(LoSA),这是一种将低秩适配与LLM稀疏化无缝整合于统一框架的新方法,可在不增加推理延迟的前提下提升稀疏LLMs的性能。具体而言,LoSA在微调过程中根据对应的稀疏权重动态稀疏化LoRA的输出,从而确保LoRA模块能够在训练后整合至稀疏LLMs中。此外,LoSA利用表示互信息(RMI)作为衡量层重要性的指标,从而在微调过程中高效确定各层的稀疏率。基于此,LoSA根据层间重构误差的差异动态调整LoRA模块的秩,为每一层分配合适的微调资源,以减少稠密与稀疏LLMs之间的输出差异。大量实验表明,LoSA能够在数小时内有效提升稀疏LLMs的性能,且不引入额外推理开销。例如,LoSA将稀疏LLaMA-2-7B的困惑度降低了68.73,零样本准确率提升了16.32%,在CPU上实现2.60倍加速,在GPU上实现2.23倍加速,且仅需在单张NVIDIA A100 80GB GPU上进行45分钟微调。代码已开源:https://github.com/wzhuang-xmu/LoSA。