Contrastive Language-Image Pre-training (CLIP) has drawn increasing attention recently for its transferable visual representation learning. However, due to the semantic gap within datasets, CLIP's pre-trained image-text alignment becomes sub-optimal on downstream tasks, which severely harms its transferring performance. To better adapt the cross-modality embedding space, we propose to enhance CLIP via Visual-guided Texts, named VT-CLIP. Specifically, we guide textual features of different categories to adaptively explore informative regions on the image and aggregate visual features by attention mechanisms. In this way, the texts become visual-guided, namely, more semantically correlated with downstream images, which greatly benefits the category-wise matching process. In few-shot settings, we evaluate our VT-CLIP on 11 well-known classification datasets to demonstrate its effectiveness.
翻译:对比语言-图像预训练(CLIP)因其可迁移的视觉表征学习能力近来备受关注。然而,受数据集内语义鸿沟的影响,CLIP预训练的图文对齐在下游任务中变得次优,严重损害了其迁移性能。为更好地适应跨模态嵌入空间,我们提出通过视觉引导文本增强CLIP,即VT-CLIP。具体而言,我们引导不同类别的文本特征自适应探索图像中的信息区域,并借助注意力机制聚合视觉特征。通过这种方式,文本变得具有视觉引导性,即与下游图像在语义上更为相关,这极大促进了逐类匹配过程。在少样本场景下,我们基于11个知名分类数据集对VT-CLIP进行评估,验证其有效性。