We introduce the first multitasking vision transformer adapters that learn generalizable task affinities which can be applied to novel tasks and domains. Integrated into an off-the-shelf vision transformer backbone, our adapters can simultaneously solve multiple dense vision tasks in a parameter-efficient manner, unlike existing multitasking transformers that are parametrically expensive. In contrast to concurrent methods, we do not require retraining or fine-tuning whenever a new task or domain is added. We introduce a task-adapted attention mechanism within our adapter framework that combines gradient-based task similarities with attention-based ones. The learned task affinities generalize to the following settings: zero-shot task transfer, unsupervised domain adaptation, and generalization without fine-tuning to novel domains. We demonstrate that our approach outperforms not only the existing convolutional neural network-based multitasking methods but also the vision transformer-based ones. Our project page is at \url{https://ivrl.github.io/VTAGML}.
翻译:我们首次提出可学习通用任务关联性的多任务视觉Transformer适配器,该适配器能够应用于新任务和新领域。与现有参数开销大的多任务Transformer不同,我们的适配器集成到现成的视觉Transformer骨干网络中,能以参数高效的方式同时解决多个密集视觉任务。与同期方法相比,当添加新任务或新领域时,我们无需重新训练或微调。我们在适配器框架中引入了一种任务自适应注意力机制,该机制将基于梯度的任务相似性与基于注意力的任务相似性相结合。学习到的任务关联性可泛化至以下场景:零样本任务迁移、无监督领域适应以及无需微调即可泛化至新领域。实验表明,我们的方法不仅优于现有的基于卷积神经网络的多任务方法,也优于基于视觉Transformer的方法。项目页面见 \url{https://ivrl.github.io/VTAGML}。