Neural language models (LMs) are arguably less data-efficient than humans from a language acquisition perspective. One fundamental question is why this human-LM gap arises. This study explores the advantage of grounded language acquisition, specifically the impact of visual information -- which humans can usually rely on but LMs largely do not have access to during language acquisition -- on syntactic generalization in LMs. Our experiments, following the poverty of stimulus paradigm under two scenarios (using artificial vs. naturalistic images), demonstrate that if the alignments between the linguistic and visual components are clear in the input, access to vision data does help with the syntactic generalization of LMs, but if not, visual input does not help. This highlights the need for additional biases or signals, such as mutual gaze, to enhance cross-modal alignment and enable efficient syntactic generalization in multimodal LMs.
翻译:从语言习得的角度来看,神经语言模型(LMs)的数据效率可能低于人类。一个根本性的问题是,这种人类与语言模型之间的差距为何产生。本研究探讨了具身语言习得的优势,特别是视觉信息的影响——人类在语言习得过程中通常可以依赖视觉信息,而语言模型在很大程度上无法获取——对语言模型句法泛化的作用。我们的实验遵循刺激贫乏范式,在两种场景(使用人工图像与自然图像)下进行,结果表明:如果输入中语言成分与视觉成分之间的对齐关系清晰,那么获取视觉数据确实有助于语言模型的句法泛化;反之,若对齐关系不清晰,则视觉输入并无帮助。这突显了需要额外的偏置或信号(例如共同注视)来增强跨模态对齐,从而实现多模态语言模型的高效句法泛化。