Pre-trained vision transformers have strong representation benefits to various downstream tasks. Recently, many parameter-efficient fine-tuning (PEFT) methods have been proposed, and their experiments demonstrate that tuning only 1% of extra parameters could surpass full fine-tuning in low-data resource scenarios. However, these methods overlook the task-specific information when fine-tuning diverse downstream tasks. In this paper, we propose a simple yet effective method called "Salient Channel Tuning" (SCT) to leverage the task-specific information by forwarding the model with the task images to select partial channels in a feature map that enables us to tune only 1/8 channels leading to significantly lower parameter costs. Experiments outperform full fine-tuning on 18 out of 19 tasks in the VTAB-1K benchmark by adding only 0.11M parameters of the ViT-B, which is 780$\times$ fewer than its full fine-tuning counterpart. Furthermore, experiments on domain generalization and few-shot learning surpass other PEFT methods with lower parameter costs, demonstrating our proposed tuning technique's strong capability and effectiveness in the low-data regime.
翻译:预训练视觉Transformer对各类下游任务具有强大的表示优势。近年来,大量参数高效微调(PEFT)方法被提出,其实验表明,在低数据资源场景下,仅调节1%的额外参数即可超越全参数微调。然而,这些方法在微调不同下游任务时忽视了任务特定信息。本文提出一种名为"显著通道调节"(SCT)的简单高效方法,通过利用任务图像前向传播模型来选择特征图中的部分通道,使得仅需调节1/8的通道即可显著降低参数成本。实验表明,在VTAB-1K基准测试的19个任务中,该方法在18个任务上以仅增加ViT-B的0.11M参数(比其全参数微调版本减少780倍)优于全参数微调。此外,在域泛化和少样本学习实验中,该方法以更低的参数成本超越其他PEFT方法,展示了所提出调节技术在低数据场景下的强大能力与有效性。