This work investigates a simple yet powerful dense prediction task adapter for Vision Transformer (ViT). Unlike recently advanced variants that incorporate vision-specific inductive biases into their architectures, the plain ViT suffers inferior performance on dense predictions due to weak prior assumptions. To address this issue, we propose the ViT-Adapter, which allows plain ViT to achieve comparable performance to vision-specific transformers. Specifically, the backbone in our framework is a plain ViT that can learn powerful representations from large-scale multi-modal data. When transferring to downstream tasks, a pre-training-free adapter is used to introduce the image-related inductive biases into the model, making it suitable for these tasks. We verify ViT-Adapter on multiple dense prediction tasks, including object detection, instance segmentation, and semantic segmentation. Notably, without using extra detection data, our ViT-Adapter-L yields state-of-the-art 60.9 box AP and 53.0 mask AP on COCO test-dev. We hope that the ViT-Adapter could serve as an alternative for vision-specific transformers and facilitate future research. The code and models will be released at https://github.com/czczup/ViT-Adapter.
翻译:本工作研究了一种简单而强大的针对视觉Transformer(ViT)的密集预测任务适配器。与近期将视觉特定归纳偏置融入其架构的先进变体不同,普通ViT由于先验假设较弱,在密集预测任务上表现不佳。为解决这一问题,我们提出ViT-Adapter,使得普通ViT能够达到与视觉特定Transformer相当的性能。具体而言,我们框架中的主干网络是一个能从大规模多模态数据中学习强大表征的普通ViT。在迁移至下游任务时,使用一个无需预训练的适配器将图像相关归纳偏置引入模型,使其适用于这些任务。我们在多个密集预测任务上验证了ViT-Adapter,包括目标检测、实例分割和语义分割。值得注意的是,在使用额外检测数据的情况下,我们的ViT-Adapter-L在COCO test-dev上取得了60.9 box AP和53.0 mask AP的顶尖性能。我们希望ViT-Adapter能作为视觉特定Transformer的替代方案,并促进未来研究。代码和模型将发布在https://github.com/czczup/ViT-Adapter。