When pre-trained models become rapidly larger, the cost of fine-tuning on downstream tasks steadily increases, too. To economically fine-tune these models, parameter-efficient transfer learning (PETL) is proposed, which only tunes a tiny subset of trainable parameters to efficiently learn quality representations. However, current PETL methods are facing the dilemma that during training the GPU memory footprint is not effectively reduced as trainable parameters. PETL will likely fail, too, if the full fine-tuning encounters the out-of-GPU-memory issue. This phenomenon happens because trainable parameters from these methods are generally entangled with the backbone, such that a lot of intermediate states have to be stored in GPU memory for gradient propagation. To alleviate this problem, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream tasks. We conducted extensive experiments to validate the effectiveness of our method. The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy, achieving new state-of-the-art on several standard benchmarks. The code is available at https://github.com/heekhero/DTL.
翻译:随着预训练模型规模迅速增大,微调下游任务的成本也持续攀升。为经济地微调这些模型,参数高效迁移学习(PETL)得以提出,该方法仅调整极少量可训练参数,以高效学习优质表征。然而,当前PETL方法面临困境:训练过程中GPU内存占用并未像可训练参数那样有效降低。若全参数微调遭遇GPU显存溢出问题,PETL也可能失效。这一现象源于这些方法的可训练参数通常与主干网络相互纠缠,导致大量中间状态必须存储在GPU内存中以供梯度传播。为解决此问题,我们提出解耦迁移学习(DTL),利用轻量级紧凑侧网络(CSN)将可训练参数与主干网络解耦。通过使用少量低秩线性映射渐进提取任务特定信息,并将其恰当添加回主干网络,CSN有效实现了多种下游任务的知识迁移。我们进行了大量实验验证该方法有效性。所提方法不仅大幅降低了GPU内存占用与可训练参数规模,在准确率上还显著超越现有PETL方法,在多个标准基准测试中达到新最优水平。代码已开源在 https://github.com/heekhero/DTL。