Hierarchical Side-Tuning for Vision Transformers

Fine-tuning pre-trained Vision Transformers (ViT) has consistently demonstrated promising performance in the realm of visual recognition. However, adapting large pre-trained models to various tasks poses a significant challenge. This challenge arises from the need for each model to undergo an independent and comprehensive fine-tuning process, leading to substantial computational and memory demands. While recent advancements in Parameter-efficient Transfer Learning (PETL) have demonstrated their ability to achieve superior performance compared to full fine-tuning with a smaller subset of parameter updates, they tend to overlook dense prediction tasks such as object detection and segmentation. In this paper, we introduce Hierarchical Side-Tuning (HST), a novel PETL approach that enables ViT transfer to various downstream tasks effectively. Diverging from existing methods that exclusively fine-tune parameters within input spaces or certain modules connected to the backbone, we tune a lightweight and hierarchical side network (HSN) that leverages intermediate activations extracted from the backbone and generates multi-scale features to make predictions. To validate HST, we conducted extensive experiments encompassing diverse visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Notably, our method achieves state-of-the-art average Top-1 accuracy of 76.0% on VTAB-1k, all while fine-tuning a mere 0.78M parameters. When applied to object detection tasks on COCO testdev benchmark, HST even surpasses full fine-tuning and obtains better performance with 49.7 box AP and 43.2 mask AP using Cascade Mask R-CNN.

翻译：微调预训练的视觉Transformer（Vision Transformers, ViT）在视觉识别领域始终展现出卓越性能。然而，将大规模预训练模型适配至多种任务面临重大挑战：每个模型需经历独立且全面的微调过程，导致计算与内存需求显著增加。尽管近期参数高效迁移学习（PETL）方法的进展表明，其能以更少的参数更新实现优于全微调的性能，但这些方法往往忽略了目标检测与分割等密集预测任务。本文提出层次化侧调优（Hierarchical Side-Tuning, HST）——一种新颖的PETL方法，可有效促使ViT迁移至各类下游任务。不同于现有方法仅在输入空间或骨干网络相关模块中微调参数，我们通过调优轻量级层次化侧网络（HSN），利用从骨干网络提取的中间激活生成多尺度特征进行预测。为验证HST的有效性，我们在涵盖分类、目标检测、实例分割与语义分割的多样化视觉任务上开展了大量实验。值得注意的是，本方法在VTAB-1k基准上实现了76.0%的平均Top-1准确率，仅需微调0.78M参数。当应用于COCO testdev基准的目标检测任务时，HST甚至超越全微调方法，采用Cascade Mask R-CNN框架获得了49.7的框AP与43.2的掩膜AP，展现出更优性能。