With the prevalence of pre-training-fine-tuning paradigm, how to efficiently adapt the pre-trained model to the downstream tasks has been an intriguing issue. Parameter-Efficient Fine-Tuning (PEFT) methods have been proposed for low-cost adaptation, including Adapters, Bia-only, and the recently widely used Low-Rank Adaptation. Although these methods have demonstrated their effectiveness to some extent and have been widely applied, the underlying principles are still unclear. In this paper, we reveal the transition of loss landscape in the downstream domain from random initialization to pre-trained initialization, that is, from low-amplitude oscillation to high-amplitude oscillation. The parameter gradients exhibit a property akin to sparsity, where a small fraction of components dominate the total gradient norm, for instance, 1% of the components account for 99% of the gradient. This property ensures that the pre-trained model can easily find a flat minimizer which guarantees the model's ability to generalize even with a low number of trainable parameters. Based on this, we propose a gradient-based sparse fine-tuning algorithm, named Sparse Increment Fine-Tuning (SIFT), and validate its effectiveness on a range of tasks including the GLUE Benchmark and Instruction-tuning. The code is accessible at https://github.com/song-wx/SIFT/.
翻译:随着预训练-微调范式的普及,如何高效地将预训练模型适配至下游任务成为引人关注的问题。参数高效微调方法(如适配器、仅偏置微调及近期广泛使用的低秩适应)虽已展现出一定有效性并得到广泛应用,但其底层原理尚不明确。本文揭示了下游领域中损失曲面从随机初始化到预训练初始化的转变——即从低振幅振荡演变为高振幅振荡。参数梯度呈现类似稀疏性的特性:极小部分的梯度分量主导着总梯度范数,例如1%的梯度分量贡献了99%的梯度范数。这一特性确保预训练模型能够轻易找到平坦的极小值点,从而保障即便仅使用极少量可训练参数时模型的泛化能力。基于此发现,我们提出基于梯度的稀疏微调算法——稀疏增量微调(SIFT),并在GLUE基准测试和指令微调等系列任务中验证其有效性。代码可访问https://github.com/song-wx/SIFT/获取。