Fine-tuning large-scale pre-trained models with limited data presents significant challenges for generalization. While Sharpness-Aware Minimization (SAM) has proven effective in improving generalization by seeking flat minima, its substantial extra memory and computation overhead make it impractical for large models. Integrating SAM with parameter-efficient fine-tuning methods like Low-Rank Adaptation (LoRA) is a promising direction. However, we find that directly applying SAM to LoRA parameters limits the sharpness optimization to a restricted subspace, hindering its effectiveness. To address this limitation, we propose Bi-directional Low-Rank Adaptation (Bi-LoRA), which introduces an auxiliary LoRA module to model SAM's adversarial weight perturbations. It decouples SAM's weight perturbations from LoRA optimization: the primary LoRA module adapts to specific tasks via standard gradient descent, while the auxiliary module captures the sharpness of the loss landscape through gradient ascent. Such dual-module design enables Bi-LoRA to capture broader sharpness for achieving flatter minima while remaining memory-efficient. Another important benefit is that the dual design allows for simultaneous optimization and perturbation, eliminating SAM's doubled training costs. Extensive experiments across diverse tasks and architectures demonstrate Bi-LoRA's efficiency and effectiveness in enhancing generalization.
翻译:摘要:使用有限数据微调大规模预训练模型对泛化能力提出了重大挑战。尽管锐度感知最小化(SAM)通过寻求平坦极小值在提升泛化性能方面效果显著,但其巨大的额外内存与计算开销使其难以应用于大规模模型。将SAM与参数高效微调方法(如低秩适配(LoRA))相结合是一个有前景的方向。然而,我们发现直接将SAM应用于LoRA参数会将锐度优化限制在子空间中,从而削弱其有效性。为解决这一局限,我们提出双向低秩适配(Bi-LoRA),通过引入辅助LoRA模块建模SAM的对抗性权重扰动。该方法将SAM的权重扰动与LoRA优化解耦:主LoRA模块通过标准梯度下降适配特定任务,而辅助模块则通过梯度上升捕获损失曲面的锐度。这种双模块设计使Bi-LoRA能在保持内存高效的同时,捕获更广的锐度以实现更平坦的极小值。另一重要优势在于,双设计允许同步进行优化与扰动,消除了SAM的双倍训练成本。在多种任务与架构上的大量实验表明,Bi-LoRA在提升泛化性能方面兼具高效性与有效性。