Large-scale pre-trained models have achieved remarkable success in various computer vision tasks. A standard approach to leverage these models is to fine-tune all model parameters for downstream tasks, which poses challenges in terms of computational and storage costs. Recently, inspired by Natural Language Processing (NLP), parameter-efficient transfer learning has been successfully applied to vision tasks. However, most existing techniques primarily focus on single-task adaptation, and despite limited research on multi-task adaptation, these methods often exhibit suboptimal training and inference efficiency. In this paper, we first propose an once-for-all Vision Multi-Task Adapter (VMT-Adapter), which strikes approximately O(1) training and inference efficiency w.r.t task number. Concretely, VMT-Adapter shares the knowledge from multiple tasks to enhance cross-task interaction while preserves task-specific knowledge via independent knowledge extraction modules. Notably, since task-specific modules require few parameters, VMT-Adapter can handle an arbitrary number of tasks with a negligible increase of trainable parameters. We also propose VMT-Adapter-Lite, which further reduces the trainable parameters by learning shared parameters between down- and up-projections. Extensive experiments on four dense scene understanding tasks demonstrate the superiority of VMT-Adapter(-Lite), achieving a 3.96%(1.34%) relative improvement compared to single-task full fine-tuning, while utilizing merely ~1% (0.36%) trainable parameters of the pre-trained model.
翻译:大规模预训练模型已在各类计算机视觉任务中取得显著成功。利用这些模型的标准方法是对下游任务微调全部模型参数,这带来了计算与存储成本方面的挑战。受自然语言处理(NLP)领域启发,参数高效迁移学习近期已成功应用于视觉任务。然而,现有技术主要聚焦于单任务适配,且尽管针对多任务适配的研究有限,这些方法往往在训练与推理效率上表现欠佳。本文首次提出一种一次性视觉多任务适配器(VMT-Adapter),其训练与推理效率在任务数量上达到约O(1)量级。具体而言,VMT-Adapter通过共享多任务知识增强跨任务交互,同时借助独立知识提取模块保留任务特定知识。值得注意的是,由于任务特定模块参数极少,VMT-Adapter能以可训练参数可忽略的增长处理任意数量任务。我们还提出VMT-Adapter-Lite,通过学习下投影与上投影之间的共享参数进一步减少可训练参数。在四项密集场景理解任务上的大量实验表明,VMT-Adapter(-Lite)相比单任务全量微调取得3.96%(1.34%)的相对性能提升,同时仅需使用预训练模型约1%(0.36%)的可训练参数。