The ability to connect language units to their referents in the physical world, referred to as grounding, is crucial to learning and understanding grounded meanings of words. While humans demonstrate fast mapping in new word learning, it remains unclear whether modern vision-language models can truly represent language with their grounded meanings and how grounding may further bootstrap new word learning. To this end, we introduce Grounded Open Vocabulary Acquisition (GOVA) to examine grounding and bootstrapping in open-world language learning. As an initial attempt, we propose object-oriented BERT (OctoBERT), a novel visually-grounded language model by pre-training on image-text pairs highlighting grounding as an objective. Through extensive experiments and analysis, we demonstrate that OctoBERT is a more coherent and fast grounded word learner, and that the grounding ability acquired during pre-training helps the model to learn unseen words more rapidly and robustly. Our code is available at https://github.com/sled-group/world-to-words
翻译:将语言单元与其在物理世界中的指称对象连接的能力(即“具身化”)对于学习和理解词语的具体意义至关重要。尽管人类在新词学习中表现出快速映射能力,但现代视觉语言模型是否能真正通过具身化意义表征语言,以及具身化如何进一步促进新词学习,仍尚不明确。为此,我们提出了“具身开放词汇习得”(GOVA)框架,以探究开放世界语言学习中的具身化与自举机制。作为初步尝试,我们提出了面向对象的BERT(OctoBERT),这是一种通过以具身化为目标的图像-文本对预训练的新型视觉语言模型。通过大量实验与分析,我们证明了OctoBERT是一个更具连贯性与快速具身化能力的词语学习器,且其在预训练阶段获得的具身化能力有助于模型更快速、更鲁棒地学习未见词汇。我们的代码已开源至 https://github.com/sled-group/world-to-words。