Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1\% extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments on 19 visual transfer learning downstream tasks demonstrate that our SCT outperforms full fine-tuning on 18 out of 19 tasks by adding only 0.11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot classification further demonstrate the effectiveness and generic of our approach. The code is available at https://github.com/showlab/SCT.
翻译:预训练视觉Transformer对各类下游任务具有强大的表示优势。近年来,许多参数高效微调方法已被提出,其实验表明在低数据资源场景下,仅调整1%的额外参数即可超越全量微调效果。然而,这些方法在微调不同下游任务时忽略了任务特异性信息。本文提出一种简单有效的方法——"显著通道调谐",通过输入任务图像驱动模型前向传播,从特征图中选择部分通道,实现仅调整1/8通道即可显著降低参数成本。在19个视觉迁移学习下游任务上的实验表明,我们的SCT方法仅为ViT-B增加0.11M参数(是其全量微调参数的1/780),即在18/19的任务上超越全量微调。此外,域泛化与少样本分类实验进一步验证了本方法的有效性与普适性。代码开源于https://github.com/showlab/SCT。