Fine-tuning pre-trained large language models in a parameter-efficient manner is widely studied for its effectiveness and efficiency. The popular method of low-rank adaptation (LoRA) offers a notable approach, hypothesizing that the adaptation process is intrinsically low-dimensional. Although LoRA has demonstrated commendable performance, it is implemented with a fixed and unalterable intrinsic rank that might not always be the ideal choice. Recognizing the need for more flexible adaptation, we extend the methodology of LoRA to an innovative approach we call sparse low-rank adaptation (SoRA) that enables dynamic adjustments to the intrinsic rank during the adaptation process. We achieve this through the incorporation of a gate unit optimized with proximal gradient method in the training stage, controlling the cardinality of rank under the sparsity of the gate. In the subsequent inference stage, we eliminate the parameter blocks corresponding to the zeroed-out ranks, to reduce each SoRA module back to a concise yet rank-optimal LoRA. Our approach strengthens the representation power of LoRA by initializing it with a higher rank, while efficiently taming a temporarily increased number of parameters via updating in a sparse way. We further introduce a sparsifying scheduler for SoRA, aiming to examine the impact of the number of non-zero parameters on the model's memorization and generalization. Our experimental results demonstrate that SoRA can outperform other baselines even with 70% retained parameters and 70% training time.
翻译:在参数高效的方式下微调预训练大型语言模型因其有效性和高效性而被广泛研究。流行的低秩适应方法(LoRA)提供了一种显著的方法,假设适应过程本质上是低维的。尽管LoRA表现出了值得称赞的性能,但它采用固定且不可改变的内在秩实现,这未必总是理想选择。认识到需要更灵活的适应,我们将LoRA的方法论扩展为一种创新方法——稀疏低秩适应(SoRA),该方法能够在适应过程中动态调整内在秩。我们通过在训练阶段引入经近端梯度法优化的门控单元来实现这一点,在门控稀疏性下控制秩的基数。在后续推理阶段,我们剔除对应于归零秩的参数块,将每个SoRA模块缩减回简洁但秩最优的LoRA。我们的方法通过以更高秩初始化LoRA来增强其表示能力,同时通过稀疏方式更新高效地控制临时增加的参数数量。我们进一步为SoRA引入稀疏化调度器,旨在研究非零参数数量对模型记忆与泛化的影响。实验结果表明,即使仅保留70%参数且训练时间减少70%,SoRA仍能超越其他基线方法。