S$^{2}$FT: Efficient, Scalable and Generalizable LLM Fine-tuning by Structured Sparsity

Current PEFT methods for LLMs can achieve either high quality, efficient training, or scalable serving, but not all three simultaneously. To address this limitation, we investigate sparse fine-tuning and observe a remarkable improvement in generalization ability. Utilizing this key insight, we propose a family of Structured Sparse Fine-Tuning (S$^{2}$FT) methods for LLMs, which concurrently achieve state-of-the-art fine-tuning performance, training efficiency, and inference scalability. S$^{2}$FT accomplishes this by "selecting sparsely and computing densely". It selects a few heads and channels in the MHA and FFN modules for each Transformer block, respectively. Next, it co-permutes weight matrices on both sides of the coupled structures in LLMs to connect the selected components in each layer into a dense submatrix. Finally, S$^{2}$FT performs in-place gradient updates on all submatrices. Through theoretical analysis and empirical results, our method prevents forgetting while simplifying optimization, delivers SOTA performance on both commonsense and arithmetic reasoning with 4.6% and 1.3% average improvements compared to LoRA, and surpasses full FT by 11.5% when generalizing to various domains after instruction tuning. Using our partial backpropagation algorithm, S$^{2}$FT saves training memory up to 3$\times$ and improves latency by 1.5-2.7$\times$ compared to full FT, while delivering an average 10% improvement over LoRA on both metrics. We further demonstrate that the weight updates in S$^{2}$FT can be decoupled into adapters, enabling effective fusion, fast switch, and efficient parallelism for serving multiple fine-tuned models.

翻译：当前针对大语言模型（LLM）的参数高效微调（PEFT）方法通常只能在高质量、高效训练或可扩展服务这三者中实现其一，而无法同时兼顾三者。为克服这一局限，我们研究了稀疏微调方法，并观察到其泛化能力有显著提升。基于这一关键洞见，我们为大语言模型提出了一系列结构化稀疏微调（S$^{2}$FT）方法，该方法能同时实现最先进的微调性能、训练效率与推理可扩展性。S$^{2}$FT通过"稀疏选择、密集计算"来实现这一目标。它分别为每个Transformer模块中的多头注意力（MHA）和前馈网络（FFN）模块选择少量注意力头和通道。接着，它对LLM中耦合结构两侧的权重矩阵进行协同置换，从而将每一层中被选中的组件连接成一个稠密子矩阵。最后，S$^{2}$FT对所有子矩阵执行原位梯度更新。通过理论分析和实证结果，我们的方法在防止灾难性遗忘的同时简化了优化过程，在常识推理和算术推理任务上均取得了最先进的性能，相较于LoRA平均分别提升了4.6%和1.3%；在指令微调后泛化到不同领域时，其性能超越全参数微调（full FT）达11.5%。借助我们提出的部分反向传播算法，与全参数微调相比，S$^{2}$FT节省了高达3$\times$的训练内存，并将延迟降低了1.5-2.7$\times$，同时在这两项指标上平均优于LoRA 10%。我们进一步证明，S$^{2}$FT中的权重更新可以解耦为适配器，从而为实现多个微调模型的有效融合、快速切换和高效并行服务提供了可能。