SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning

Parameter-efficient transfer learning (PETL) has emerged as a flourishing research field for adapting large pre-trained models to downstream tasks, greatly reducing trainable parameters while grappling with memory challenges during fine-tuning. To address it, memory-efficient series (METL) avoid backpropagating gradients through the large backbone. However, they compromise by exclusively relying on frozen intermediate outputs and limiting the exhaustive exploration of prior knowledge from pre-trained models. Moreover, the dependency and redundancy between cross-layer features are frequently overlooked, thereby submerging more discriminative representations and causing an inherent performance gap (vs. conventional PETL methods). Hence, we propose an innovative METL strategy called SHERL for resource-limited scenarios to decouple the entire adaptation into two successive and complementary processes. In the early route, intermediate outputs are consolidated via an anti-redundancy operation, enhancing their compatibility for subsequent interactions; thereby in the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead and regulate these fairly flexible features into more adaptive and powerful representations for new domains. Extensive ablations on vision-and-language and language-only tasks show that SHERL combines the strengths of both parameter and memory-efficient techniques, performing on-par or better across diverse architectures with lower memory during fine-tuning. Our code is publicly available at: https://github.com/Paranioar/SHERL.

翻译：参数高效迁移学习（PETL）已成为将大规模预训练模型适配至下游任务的重要研究方向，其在显著减少可训练参数量的同时，也面临微调过程中的内存挑战。为解决此问题，内存高效系列方法（METL）避免了通过大型主干网络进行梯度反向传播，但其代价是仅依赖冻结的中间层输出，限制了对预训练模型先验知识的充分探索。此外，跨层特征间的依赖性与冗余性常被忽视，导致更具判别性的表征被淹没，并产生固有的性能差距（相较于传统PETL方法）。为此，我们提出一种创新的METL策略SHERL，面向资源受限场景，将整体适配过程解耦为两个连续且互补的阶段。在早期路径中，通过反冗余操作整合中间层输出，增强其后续交互的兼容性；进而在后期路径中，仅使用少量深层预训练层即可缓解内存开销的峰值需求，并将这些高度灵活的特征调控为更适合新领域的强适应性表征。在视觉-语言及纯语言任务上的大量消融实验表明，SHERL综合了参数高效与内存高效技术的优势，在微调内存更低的条件下，于多种架构上取得相当或更优的性能。我们的代码已公开于：https://github.com/Paranioar/SHERL。