Fine-tuning pre-trained models has emerged as a powerful technique in numerous domains, owing to its ability to leverage enormous pre-existing knowledge and achieve remarkable performance on downstream tasks. However, updating the parameters of entire networks is computationally intensive. Although state-of-the-art parameter-efficient transfer learning (PETL) methods significantly reduce the trainable parameters and storage demand, almost all of them still need to back-propagate the gradients through large pre-trained networks. This memory-extensive characteristic extremely limits the applicability of PETL methods in real-world scenarios. To this end, we propose a new memory-efficient PETL strategy, dubbed Universal Parallel Tuning (UniPT). Specifically, we facilitate the transfer process via a lightweight learnable parallel network, which consists of two modules: 1) A parallel interaction module that decouples the inherently sequential connections and processes the intermediate activations detachedly of the pre-trained network. 2) A confidence aggregation module that learns optimal strategies adaptively for integrating cross-layer features. We evaluate UniPT with different backbones (e.g., VSE$\infty$, CLIP4Clip, Clip-ViL, and MDETR) on five challenging vision-and-language tasks (i.e., image-text retrieval, video-text retrieval, visual question answering, compositional question answering, and visual grounding). Extensive ablations on ten datasets have validated that our UniPT can not only dramatically reduce memory consumption and outperform the best memory-efficient competitor, but also achieve higher performance than existing PETL methods in a low-memory scenario on different architectures. Our code is publicly available at: https://github.com/Paranioar/UniPT.
翻译:微调预训练模型已成为众多领域中的强大技术,因其能够利用海量预存知识并在下游任务中取得显著性能。然而,更新整个网络的参数计算开销巨大。尽管当前最先进的参数高效迁移学习(PETL)方法显著降低了可训练参数和存储需求,但几乎所有方法仍需通过大型预训练网络反向传播梯度。这种高内存消耗的特性极大限制了PETL方法在真实场景中的应用。为此,我们提出一种新的内存高效PETL策略——通用并行调优(UniPT)。具体而言,我们通过轻量级可学习并行网络促进迁移过程,该网络包含两个模块:1)并行交互模块,解耦预训练网络中固有的顺序连接,并独立处理其中间激活值;2)置信度聚合模块,自适应学习整合跨层特征的最优策略。我们在五项具有挑战性的视觉-语言任务(即图像-文本检索、视频-文本检索、视觉问答、组合式问答和视觉定位)上,使用不同骨干网络(如VSE∞、CLIP4Clip、Clip-ViL和MDETR)评估了UniPT。在十个数据集上的广泛消融实验验证了,我们的UniPT不仅能够显著降低内存消耗、超越最佳内存高效竞品,而且在低内存场景下,在不同架构上实现了比现有PETL方法更高的性能。我们的代码已公开于:https://github.com/Paranioar/UniPT。