Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways -- requiring orders of magnitude more language data than children receive during development, and without any of the accompanying grounding in perception, action, or social behavior. Do models trained more naturalistically -- with grounded supervision -- exhibit more human-like language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary supervision from image captioning tasks, on datasets of varying scales. We then evaluate these models on a broad set of benchmarks characterizing models' learning of syntactic categories, lexical relations, semantic features, semantic similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- we find that models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multi-modal modeling approaches fail to effectively leverage visual information to build more human-like word representations from human-sized datasets.

翻译：现代神经语言模型（LMs）是模拟人类句子生成与理解的有力工具，其内部表征与人脑语言表征高度吻合。然而，为取得这些成果，语言模型必须采用明显非人类的方式进行训练——其所需语言数据量远超儿童发育过程中接收的量级，且缺乏感知、行为或社会互动等伴随性语境支撑。采用更符合自然规律的训练方式（即包含具象监督的模型）能否展现更接近人类语言学习的能力？本文在词汇学习（语言习得的关键子任务）语境中探究此问题。我们训练了多种架构的语言模型，并设置是否附加图像描述任务的辅助监督变量，在不同规模的数据集上进行实验。随后通过涵盖句法类别、词汇关系、语义特征、语义相似度及与人脑神经表征对齐度等维度的综合基准测试评估模型表现。研究发现：视觉监督确实能提升词汇学习效率。但这种提升存在局限——几乎仅出现在低数据条件下，且有时会被文本中丰富的分布信号所抵消。文本与图像传递的信息并非冗余：以视觉信息主导的模型与以词汇共现信息主导的模型呈现出本质差异。然而，本研究表明，当前多模态建模方法尚未能有效利用视觉信息，在人类规模的数据集上构建更接近人类的词汇表征。