Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S$^{2}$FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S$^{2}$FT accomplishes this by "selecting sparsely and computing densely". It selects a few heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, S$^{2}$FT performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents forgetting while simplifying optimization, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to LoRA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, S$^{2}$FT saves training memory up to 3$\times$ and improves latency by 1.5-2.7$\times$ compared to full FT, while delivering an average 10% improvement over LoRA on both metrics. We further demonstrate that the weight updates in S$^{2}$FT can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.
翻译:当前针对大语言模型(LLM)的参数高效微调(PEFT)方法通常只能在高质量、高效训练或可扩展服务这三者中实现其一,而无法同时兼顾三者。为克服这一局限,我们研究了稀疏微调方法,并观察到其泛化能力有显著提升。基于这一关键洞见,我们为大语言模型提出了一系列结构化稀疏微调(S$^{2}$FT)方法,该方法能同时实现最先进的微调性能、训练效率与推理可扩展性。S$^{2}$FT通过"稀疏选择、密集计算"来实现这一目标。它分别为每个Transformer模块中的多头注意力(MHA)和前馈网络(FFN)模块选择少量注意力头和通道。接着,它对LLM中耦合结构两侧的权重矩阵进行协同置换,从而将每一层中被选中的组件连接成一个稠密子矩阵。最后,S$^{2}$FT对所有子矩阵执行原位梯度更新。通过理论分析和实证结果,我们的方法在防止灾难性遗忘的同时简化了优化过程,在常识推理和算术推理任务上均取得了最先进的性能,相较于LoRA平均分别提升了4.6%和1.3%;在指令微调后泛化到不同领域时,其性能超越全参数微调(full FT)达11.5%。借助我们提出的部分反向传播算法,与全参数微调相比,S$^{2}$FT节省了高达3$\times$的训练内存,并将延迟降低了1.5-2.7$\times$,同时在这两项指标上平均优于LoRA 10%。我们进一步证明,S$^{2}$FT中的权重更新可以解耦为适配器,从而为实现多个微调模型的有效融合、快速切换和高效并行服务提供了可能。