Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition, specifically the impact of visual information -- which humans can usually rely on but LMs largely do not have access to during language acquisition -- on syntactic generalization in LMs. Our experiments, following the poverty of stimulus paradigm under two scenarios (using artificial vs. naturalistic images), demonstrate that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help. This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.
翻译:从语言习得的角度来看,神经语言模型的数据效率可能低于人类。一个根本问题是这种人类-语言模型差距为何产生。本研究探讨了具身语言习得的优势,特别是视觉信息的影响——人类在语言习得过程中通常可以依赖视觉信息,而语言模型在很大程度上无法获取——对语言模型句法泛化的作用。我们的实验遵循刺激贫乏范式,在两种场景(使用人工图像与自然图像)下进行,结果表明:如果输入中语言成分与视觉成分的对齐关系清晰,视觉数据的访问确实有助于语言模型的句法泛化;反之,则视觉输入并无助益。这凸显了需要额外的偏置或信号(如共同注视)来增强跨模态对齐,从而实现多模态语言模型的高效句法泛化。