Interpreting a seemingly-simple function word like "or", "behind", or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural-network based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learnt by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spacial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives "and" and "or" without any prior knowledge of logical reasoning, as well as early evidence that they are sensitive to alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on frequency in models' input. Our findings offer proof-of-concept evidence that it is possible to learn the nuanced interpretations of function words in visually grounded context by using non-symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning.
翻译:解释看似简单的功能词(如"或"、"在后面"或"更多")往往需要逻辑、数值和关系推理能力。儿童是如何学会这些词语的?先前的习得理论通常假设存在先天知识基础。然而,近期基于神经网络的视觉问答模型显然能够通过学习使用功能词来回答关于复杂视觉场景的问题。本文通过研究这些模型对功能词的学习机制,旨在深入理解模型与儿童习得这些词语含义的可能性。我们发现,经过视觉具象语言训练的循环模型能够习得需要空间和数值推理的功能词的梯度语义。此外,这些模型无需任何逻辑推理先验知识即可习得逻辑连接词"和"与"或"的含义,同时初步证据表明它们在语言理解过程中对替代性表述具有敏感性。最后,我们发现词语学习难度与模型输入中的词频相关。本研究为以下观点提供了概念验证证据:通过非符号化的通用统计学习算法,无需任何语言意义先验知识,即可在视觉具象语境中习得功能词的微妙语义解释。