Language with Vision: a Study on Grounded Word and Sentence Embeddings

Grounding language in vision is an active field of research seeking to construct cognitively plausible word and sentence representations by incorporating perceptual knowledge from vision into text-based representations. Despite many attempts at language grounding, achieving an optimal equilibrium between textual representations of the language and our embodied experiences remains an open field. Some common concerns are the following. Is visual grounding advantageous for abstract words, or is its effectiveness restricted to concrete words? What is the optimal way of bridging the gap between text and vision? To what extent is perceptual knowledge from images advantageous for acquiring high-quality embeddings? Leveraging the current advances in machine learning and natural language processing, the present study addresses these questions by proposing a simple yet very effective computational grounding model for pre-trained word embeddings. Our model effectively balances the interplay between language and vision by aligning textual embeddings with visual information while simultaneously preserving the distributional statistics that characterize word usage in text corpora. By applying a learned alignment, we are able to indirectly ground unseen words including abstract words. A series of evaluations on a range of behavioural datasets shows that visual grounding is beneficial not only for concrete words but also for abstract words, lending support to the indirect theory of abstract concepts. Moreover, our approach offers advantages for contextualized embeddings, such as those generated by BERT, but only when trained on corpora of modest, cognitively plausible sizes. Code and grounded embeddings for English are available at https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2.

翻译：将语言与视觉相结合是当前活跃的研究领域，旨在通过将视觉感知知识融入基于文本的表示，构建符合认知规律的词与句子表征。尽管已有诸多语言基础研究的尝试，但在语言文本表征与具身体验之间实现最优平衡仍是一个待解的难题。常见问题包括：视觉基础化是否有助于抽象词汇，还是其有效性仅局限于具体词汇？弥合文本与视觉之间鸿沟的最优方式是什么？图像中的感知知识在多大程度上有助于获取高质量嵌入？本研究利用机器学习和自然语言处理的最新进展，通过提出一种简单但高效的预训练词嵌入计算建模方法，对上述问题进行了探讨。该模型通过将文本嵌入与视觉信息对齐，同时保留文本语料中表征词汇使用特征的分布统计特性，有效平衡了语言与视觉的相互作用。通过应用学习到的对齐策略，我们能够间接实现包括抽象词在内的未见词的基础化。对一系列行为数据集上的评估表明，视觉基础化不仅有利于具体词汇，对抽象词汇同样有益，为抽象概念的间接理论提供了支持。此外，我们的方法对上下文嵌入（如BERT生成的嵌入）具有优势，但仅当基于中等规模、符合认知规律的语料进行训练时才能体现。英文代码及基础化嵌入可通过 https://github.com/Hazel1994/Visually_Grounded_Word_Embeddings_2 获取。