Pre-training & fine-tuning is a prevalent paradigm in computer vision (CV). Recently, parameter-efficient transfer learning (PETL) methods have shown promising performance in transferring knowledge from pre-trained models with only a few trainable parameters. Despite their success, the existing PETL methods in CV can be computationally expensive and require large amounts of memory and time cost during training, which limits low-resource users from conducting research and applications on large models. In this work, we propose Parameter, Memory, and Time Efficient Visual Adapter ($\mathrm{E^3VA}$) tuning to address this issue. We provide a gradient backpropagation highway for low-rank adapters which removes large gradient computations for the frozen pre-trained parameters, resulting in substantial savings of training memory and training time. Furthermore, we optimise the $\mathrm{E^3VA}$ structure for dense predictions tasks to promote model performance. Extensive experiments on COCO, ADE20K, and Pascal VOC benchmarks show that $\mathrm{E^3VA}$ can save up to 62.2% training memory and 26.2% training time on average, while achieving comparable performance to full fine-tuning and better performance than most PETL methods. Note that we can even train the Swin-Large-based Cascade Mask RCNN on GTX 1080Ti GPUs with less than 1.5% trainable parameters.
翻译:预训练-微调范式在计算机视觉领域广泛采用。近年来,参数高效迁移学习方法通过仅训练少量可学习参数即可实现从预训练模型的知识迁移,展现出显著潜力。然而,现有计算机视觉中的参数高效迁移学习方法存在计算成本高、训练期间内存占用大、训练耗时长的局限,制约了低资源用户对大型模型的研究与应用。针对该问题,本文提出参数、内存与时间高效视觉适配器($\mathrm{E^3VA}$)微调方法。我们为低秩适配器构建梯度反向传播高速通道,消除了冻结预训练参数的大量梯度计算,从而显著降低训练内存与时间开销。进一步针对密集预测任务优化$\mathrm{E^3VA}$结构以提升模型性能。在COCO、ADE20K和Pascal VOC基准上的大量实验表明,$\mathrm{E^3VA}$平均可节省62.2%训练内存和26.2%训练时间,同时达到与全参数微调相当的性能,并优于多数参数高效迁移学习方法。值得注意的是,我们甚至可在GTX 1080Ti GPU上以不足1.5%的可训练参数量训练基于Swin-Large的Cascade Mask RCNN模型。