How direct is the link between words and images?

Current word embedding models despite their success, still suffer from their lack of grounding in the real world. In this line of research, Gunther et al. 2022 proposed a behavioral experiment to investigate the relationship between words and images. In their setup, participants were presented with a target noun and a pair of images, one chosen by their model and another chosen randomly. Participants were asked to select the image that best matched the target noun. In most cases, participants preferred the image selected by the model. Gunther et al., therefore, concluded the possibility of a direct link between words and embodied experience. We took their experiment as a point of departure and addressed the following questions. 1. Apart from utilizing visually embodied simulation of given images, what other strategies might subjects have used to solve this task? To what extent does this setup rely on visual information from images? Can it be solved using purely textual representations? 2. Do current visually grounded embeddings explain subjects' selection behavior better than textual embeddings? 3. Does visual grounding improve the semantic representations of both concrete and abstract words? To address these questions, we designed novel experiments by using pre-trained textual and visually grounded word embeddings. Our experiments reveal that subjects' selection behavior is explained to a large extent based on purely text-based embeddings and word-based similarities, suggesting a minor involvement of active embodied experiences. Visually grounded embeddings offered modest advantages over textual embeddings only in certain cases. These findings indicate that the experiment by Gunther et al. may not be well suited for tapping into the perceptual experience of participants, and therefore the extent to which it measures visually grounded knowledge is unclear.

翻译：当前词嵌入模型尽管取得了成功，但仍缺乏对现实世界的具身化基础。在这一研究方向上，Gunther等人(2022)提出了一项行为实验来探究词语与图像之间的关系。在他们的实验设计中，参与者被呈现一个目标名词和两张图像（一张由模型选择，另一张随机选取），并要求选出与目标名词最匹配的图像。结果表明，参与者多数情况下倾向于选择模型所选图像。Gunther等人由此推断词语与具身体验可能存在直接联系。我们以其实验为出发点，提出以下问题：1. 除利用给定图像的视觉具身模拟外，被试可能采用哪些策略完成该任务？该实验设计在多大程度上依赖图像的视觉信息？是否可能仅通过纯文本表示解决该问题？2. 当前基于视觉具身化的词嵌入是否比纯文本嵌入更能解释被试的选择行为？3. 视觉具身化是否能同时改善具体名词和抽象名词的语义表征？为解答这些问题，我们设计了基于预训练文本嵌入和视觉具身化词嵌入的新实验。实验表明，被试的选择行为在很大程度上可通过纯文本嵌入及词语相似性进行解释，表明主动具身体验的参与度较低。视觉具身化嵌入仅在特定情况下优于文本嵌入，且提升幅度有限。这些发现表明Gunther等人的实验可能并不适用于有效调动被试的感知体验，因此其衡量视觉具身化知识的程度尚不明确。