Hierarchical Side-Tuning for Vision Transformers

Fine-tuning pre-trained Vision Transformers (ViT) has consistently demonstrated promising performance in the realm of visual recognition. However, adapting large pre-trained models to various tasks poses a significant challenge. This challenge arises from the need for each model to undergo an independent and comprehensive fine-tuning process, leading to substantial computational and memory demands. While recent advancements in Parameter-efficient Transfer Learning (PETL) have demonstrated their ability to achieve superior performance compared to full fine-tuning with a smaller subset of parameter updates, they tend to overlook dense prediction tasks such as object detection and segmentation. In this paper, we introduce Hierarchical Side-Tuning (HST), a novel PETL approach that enables ViT transfer to various downstream tasks effectively. Diverging from existing methods that exclusively fine-tune parameters within input spaces or certain modules connected to the backbone, we tune a lightweight and hierarchical side network (HSN) that leverages intermediate activations extracted from the backbone and generates multi-scale features to make predictions. To validate HST, we conducted extensive experiments encompassing diverse visual tasks, including classification, object detection, instance segmentation, and semantic segmentation. Notably, our method achieves state-of-the-art average Top-1 accuracy of 76.0% on VTAB-1k, all while fine-tuning a mere 0.78M parameters. When applied to object detection tasks on COCO testdev benchmark, HST even surpasses full fine-tuning and obtains better performance with 49.7 box AP and 43.2 mask AP using Cascade Mask R-CNN.

翻译：对预训练的视觉Transformer（ViT）进行微调已在视觉识别领域持续展现出优异性能。然而，将大规模预训练模型适配至多种任务面临重大挑战，这源于每个模型都需要独立完成全面的微调过程，从而产生巨大的计算与内存需求。尽管近期参数高效迁移学习（PETL）方法已证明其能以更少的参数更新量达到甚至超越全微调的性能，但这些方法往往忽略了密集预测任务（如目标检测和分割）。本文提出层级侧调谐（HST），这是一种新型PETL方法，能够有效实现ViT向多种下游任务的迁移。与现有仅微调输入空间参数或主干网络特定模块的方法不同，我们设计了一个轻量级层级侧网络（HSN），该网络利用从主干网络中提取的中间激活特征，生成多尺度特征进行预测。为验证HST的有效性，我们开展了涵盖多种视觉任务（包括分类、目标检测、实例分割和语义分割）的广泛实验。值得注意的是，本方法在VTAB-1k基准上仅需微调0.78M参数即可达到76.0%的平均Top-1准确率，创下最优性能。当应用于COCO testdev基准的目标检测任务时，HST甚至超越全微调，采用Cascade Mask R-CNN后取得了49.7框AP和43.2掩码AP的更优性能。