Vision-Language (VL) models with the Two-Tower architecture have dominated visual-language representation learning in recent years. Current VL models either use lightweight uni-modal encoders and learn to extract, align and fuse both modalities simultaneously in a deep cross-modal encoder, or feed the last-layer uni-modal representations from the deep pre-trained uni-modal encoders into the top cross-modal encoder. Both approaches potentially restrict vision-language representation learning and limit model performance. In this paper, we propose Bridge-Tower, which introduces multiple bridge layers that build a connection between the top layers of uni-modal encoders and each layer of the cross-modal encoder. This enables effective bottom-up cross-modal alignment and fusion between visual and textual representations of different semantic levels of pre-trained uni-modal encoders in the cross-modal encoder. Pre-trained with only 4M images, Bridge-Tower achieves state-of-the-art performance on various downstream vision-language tasks. In particular, on the VQAv2 test-std set, Bridge-Tower achieves an accuracy of 78.73%, outperforming the previous state-of-the-art model METER by 1.09% with the same pre-training data and almost negligible additional parameters and computational costs. Notably, when further scaling the model, Bridge-Tower achieves an accuracy of 81.15%, surpassing models that are pre-trained on orders-of-magnitude larger datasets. Code and checkpoints are available at \url{https://github.com/microsoft/BridgeTower}.
翻译:采用双塔架构的视觉-语言(VL)模型近年来主导了视觉-语言表示学习领域。当前的VL模型要么使用轻量级单模态编码器,并在深层跨模态编码器中同步学习提取、对齐和融合两种模态信息;要么将深度预训练单模态编码器的最后一层表示输入到顶层跨模态编码器中。这两种方法均可能限制视觉-语言表示学习的效果,并制约模型性能。本文提出Bridge-Tower,通过引入多个桥接层,在单模态编码器的顶层与跨模态编码器的每一层之间建立连接。这种设计使得跨模态编码器能够对预训练单模态编码器中不同语义层次的视觉和文本表示进行有效的自底向上跨模态对齐与融合。仅使用400万张图像进行预训练,Bridge-Tower便在多项下游视觉-语言任务中达到最优性能。具体而言,在VQAv2测试标准集上,Bridge-Tower以78.73%的准确率超越此前最优模型METER 1.09%,且仅引入可忽略不计的额外参数与计算成本。值得注意的是,在进一步扩展模型规模后,Bridge-Tower达到81.15%的准确率,超越了在数量级更大数据集上预训练的模型。代码与模型检查点已开源至\url{https://github.com/microsoft/BridgeTower}。