Pre-training visual and textual representations from large-scale image-text pairs is becoming a standard approach for many downstream vision-language tasks. The transformer-based models learn inter and intra-modal attention through a list of self-supervised learning tasks. This paper proposes LAViTeR, a novel architecture for visual and textual representation learning. The main module, Visual Textual Alignment (VTA) will be assisted by two auxiliary tasks, GAN-based image synthesis and Image Captioning. We also propose a new evaluation metric measuring the similarity between the learnt visual and textual embedding. The experimental results on two public datasets, CUB and MS-COCO, demonstrate superior visual and textual representation alignment in the joint feature embedding space
翻译:从大规模图像-文本对中预训练视觉与文本表示正逐渐成为解决众多下游视觉-语言任务的标准方法。基于Transformer的模型通过一系列自监督学习任务来学习模态间与模态内的注意力机制。本文提出LAViTeR——一种新颖的视觉与文本表示学习架构。其核心模块“视觉-文本对齐”将借助两项辅助任务进行增强:基于生成对抗网络的图像合成与图像描述生成。我们还提出一种新的评估指标,用于度量所学视觉嵌入与文本嵌入之间的相似性。在CUB和MS-COCO两个公开数据集上的实验结果表明,该模型在联合特征嵌入空间中实现了更优的视觉与文本表示对齐。