Intermediate features of a pre-trained model have been shown informative for making accurate predictions on downstream tasks, even if the model backbone is kept frozen. The key challenge is how to utilize these intermediate features given their gigantic amount. We propose visual query tuning (VQT), a simple yet effective approach to aggregate intermediate features of Vision Transformers. Through introducing a handful of learnable ``query'' tokens to each layer, VQT leverages the inner workings of Transformers to ``summarize'' rich intermediate features of each layer, which can then be used to train the prediction heads of downstream tasks. As VQT keeps the intermediate features intact and only learns to combine them, it enjoys memory efficiency in training, compared to many other parameter-efficient fine-tuning approaches that learn to adapt features and need back-propagation through the entire backbone. This also suggests the complementary role between VQT and those approaches in transfer learning. Empirically, VQT consistently surpasses the state-of-the-art approach that utilizes intermediate features for transfer learning and outperforms full fine-tuning in many cases. Compared to parameter-efficient approaches that adapt features, VQT achieves much higher accuracy under memory constraints. Most importantly, VQT is compatible with these approaches to attain even higher accuracy, making it a simple add-on to further boost transfer learning.
翻译:预训练模型的中间特征已被证明在保持模型主干冻结的情况下,也能为下游任务的准确预测提供信息。关键挑战在于如何利用这些数量庞大的中间特征。我们提出视觉查询调优,一种简单有效的方法来聚合视觉Transformer的中间特征。通过向每层引入少量可学习的"查询"令牌,视觉查询调优利用Transformer的内部机制"汇总"每层丰富的中间特征,进而用于训练下游任务的预测头。由于视觉查询调优保持中间特征不变,仅学习如何组合它们,因此在训练时具有内存效率,这区别于许多其他参数高效微调方法——这些方法需要适应特征并通过整个主干进行反向传播。这也表明视觉查询调优与这些方法在迁移学习中具有互补作用。实验表明,视觉查询调优持续超过利用中间特征进行迁移学习的最先进方法,并在许多情况下优于全微调。与适应特征的参数高效方法相比,视觉查询调优在内存约束下实现了更高的准确率。最重要的是,视觉查询调优可与这些方法兼容以获得更高准确率,使其成为进一步提升迁移学习的简单附加模块。