When pre-trained models become rapidly larger, the cost of fine-tuning on downstream tasks steadily increases, too. To economically fine-tune these models, parameter-efficient transfer learning (PETL) is proposed, which only tunes a tiny subset of trainable parameters to efficiently learn quality representations. However, current PETL methods are facing the dilemma that during training the GPU memory footprint is not effectively reduced as trainable parameters. PETL will likely fail, too, if the full fine-tuning encounters the out-of-GPU-memory issue. This phenomenon happens because trainable parameters from these methods are generally entangled with the backbone, such that a lot of intermediate states have to be stored in GPU memory for gradient propagation. To alleviate this problem, we introduce Disentangled Transfer Learning (DTL), which disentangles the trainable parameters from the backbone using a lightweight Compact Side Network (CSN). By progressively extracting task-specific information with a few low-rank linear mappings and appropriately adding the information back to the backbone, CSN effectively realizes knowledge transfer in various downstream tasks. We conducted extensive experiments to validate the effectiveness of our method. The proposed method not only reduces a large amount of GPU memory usage and trainable parameters, but also outperforms existing PETL methods by a significant margin in accuracy, achieving new state-of-the-art on several standard benchmarks.
翻译:随着预训练模型的规模迅速增大,在下游任务上进行微调的成本也在持续增加。为了经济地微调这些模型,参数高效迁移学习(PETL)被提出,该方法仅调整极小部分可训练参数,从而高效学习优质表征。然而,当前PETL方法面临一个困境:在训练过程中,GPU内存占用并未随可训练参数有效减少。若全量微调遇到GPU内存不足的问题,PETL也可能会失效。这一现象的出现是因为这些方法的可训练参数通常与主干网络相互纠缠,导致大量中间状态必须存储在GPU内存中以供梯度传播使用。为缓解这一问题,我们提出了解耦迁移学习(DTL),该方法通过轻量级紧凑侧网络(CSN)将可训练参数与主干网络解耦。通过利用少量低秩线性映射逐步提取任务特定信息,并将其适当地添加回主干网络,CSN有效实现了多种下游任务中的知识迁移。我们进行了大量实验以验证该方法的有效性。所提出的方法不仅显著降低了GPU内存占用和可训练参数量,而且在准确率上远超现有PETL方法,在多个标准基准上达到了新的最优水平。