Vision (image and video) - Language (VL) pre-training is the recent popular paradigm that achieved state-of-the-art results on multi-modal tasks like image-retrieval, video-retrieval, visual question answering etc. These models are trained in an unsupervised way and greatly benefit from the complementary modality supervision. In this paper, we explore if the language representations trained using vision supervision perform better than vanilla language representations on Natural Language Understanding and commonsense reasoning benchmarks. We experiment with a diverse set of image-text models such as ALBEF, BLIP, METER and video-text models like ALPRO, Frozen-in-Time (FiT), VIOLET. We compare the performance of language representations of stand-alone text encoders of these models to the language representations of text encoders learnt through vision supervision. Our experiments suggest that vanilla language representations show superior performance on most of the tasks. These results shed light on the current drawbacks of the vision-language models.
翻译:视觉(图像与视频)-语言联合预训练是近年来流行的范式,在图像检索、视频检索、视觉问答等多模态任务中取得了最先进的结果。这些模型以无监督方式训练,并显著受益于互补模态的监督信息。本文探究了利用视觉监督训练的语言表示在自然语言理解和常识推理基准任务上是否优于原始语言表示。我们使用了多样的图文模型(如ALBEF、BLIP、METER)及视频文本模型(如ALPRO、Frozen-in-Time (FiT)、VIOLET)进行实验,对比了这些模型中独立文本编码器的语言表示与通过视觉监督学习的文本编码器语言表示的性能。实验结果表明,原始语言表示在大部分任务中表现更优。这些结果揭示了当前视觉-语言模型的局限性。