Despite the impressive performance of autoregressive Language Models (LM) it has been shown that due to reporting bias, LMs lack visual knowledge, i.e. they do not know much about the visual world and its properties. To augment LMs with visual knowledge, existing solutions often rely on explicit images, requiring time-consuming retrieval or image generation systems. This paper shows that explicit images are not necessary to visually augment an LM. Instead, we use visually-grounded text representations obtained from the well-known CLIP multimodal system. For a fair comparison, we modify VALM, a visually-augmented LM which uses image retrieval and representation, to work directly with visually-grounded text representations. We name this new model BLIND-VALM. We show that BLIND-VALM performs on par with VALM for Visual Language Understanding (VLU), Natural Language Understanding (NLU) and Language Modeling tasks, despite being significantly more efficient and simpler. We also show that scaling up our model within the compute budget of VALM, either increasing the model or pre-training corpus size, we outperform VALM for all the evaluation tasks.
翻译:尽管自回归语言模型(LM)表现出令人印象深刻的性能,但研究表明,由于报告偏差的存在,语言模型缺乏视觉知识,即它们对视觉世界及其属性了解甚少。为了用视觉知识增强语言模型,现有解决方案通常依赖于显式图像,这需要耗时的检索或图像生成系统。本文证明,显式图像对于视觉增强语言模型并非必需。相反,我们使用从著名的CLIP多模态系统中获得的视觉接地文本表示。为了进行公平比较,我们修改了VALM(一种使用图像检索和表示的视觉增强语言模型),使其能直接处理视觉接地文本表示。我们将这一新模型命名为BLIND-VALM。我们证明,尽管BLIND-VALM在效率上显著更高且结构更简单,但在视觉语言理解(VLU)、自然语言理解(NLU)和语言建模任务上的表现与VALM相当。我们还证明,在VALM的计算预算内扩展我们的模型(无论是增加模型规模还是预训练语料库大小),我们在所有评估任务上的表现均优于VALM。