Today's most accurate language models are trained on orders of magnitude more language data than human language learners receive - but with no supervision from other sensory modalities that play a crucial role in human learning. Can we make LMs' representations and predictions more accurate (and more human-like) with more ecologically plausible supervision? This paper describes LexiContrastive Grounding (LCG), a grounded language learning procedure that leverages visual supervision to improve textual representations. LexiContrastive Grounding combines a next token prediction strategy with a contrastive visual grounding objective, focusing on early-layer representations that encode lexical information. Across multiple word-learning and sentence-understanding benchmarks, LexiContrastive Grounding not only outperforms standard language-only models in learning efficiency, but also improves upon vision-and-language learning procedures including CLIP, GIT, Flamingo, and Vokenization. Moreover, LexiContrastive Grounding improves perplexity by around 5% on multiple language modeling tasks. This work underscores the potential of incorporating visual grounding into language models, aligning more closely with the multimodal nature of human language acquisition.
翻译:当今最精确的语言模型所训练的语言数据量比人类语言学习者接收到的数据高出数个数量级——但缺乏在人类学习中起关键作用的其他感官模态的监督。我们能否通过更符合生态效度的监督方式,使语言模型的表征和预测更精确(且更接近人类表现)?本文提出词汇对比接地(LexiContrastive Grounding,LCG),一种利用视觉监督改善文本表征的接地语言学习流程。词汇对比接地将下一词预测策略与对比视觉接地目标相结合,重点关注编码词汇信息的早期层表征。在多个词汇学习和句子理解基准测试中,词汇对比接地不仅在效率上优于纯语言模型,还改善了包括CLIP、GIT、Flamingo和Vokenization在内的视觉与语言学习流程。此外,词汇对比接地在多项语言建模任务中将困惑度提升约5%。该工作强调了将视觉接地融入语言模型的潜力,使其更贴近人类语言习得的多模态本质。