Interpreting a seemingly-simple function word like "or", "behind", or "more" can require logical, numerical, and relational reasoning. How are such words learned by children? Prior acquisition theories have often relied on positing a foundation of innate knowledge. Yet recent neural-network based visual question answering models apparently can learn to use function words as part of answering questions about complex visual scenes. In this paper, we study what these models learn about function words, in the hope of better understanding how the meanings of these words can be learnt by both models and children. We show that recurrent models trained on visually grounded language learn gradient semantics for function words requiring spacial and numerical reasoning. Furthermore, we find that these models can learn the meanings of logical connectives "and" and "or" without any prior knowledge of logical reasoning, as well as early evidence that they can develop the ability to reason about alternative expressions when interpreting language. Finally, we show that word learning difficulty is dependent on frequency in models' input. Our findings offer evidence that it is possible to learn the meanings of function words in visually grounded context by using non-symbolic general statistical learning algorithms, without any prior knowledge of linguistic meaning.
翻译:解读一个看似简单的功能词,如“或者”、“在……后面”或“更多”,可能需要逻辑、数值和关系推理。儿童是如何学习这些词汇的?先前的习得理论通常依赖于假设存在先天的知识基础。然而,最近基于神经网络的视觉问答模型似乎能够通过学习使用功能词来回答关于复杂视觉场景的问题。在本文中,我们研究了这些模型学习了哪些关于功能词的信息,以期更好地理解模型和儿童如何习得这些词的含义。我们表明,在视觉语言环境下训练的循环模型能够学习到需要空间和数值推理的功能词的梯度语义。此外,我们发现这些模型可以在没有任何逻辑推理先验知识的情况下学习逻辑连接词“和”与“或者”的含义,并且有初步证据表明,它们在解读语言时可以发展出推理替代表达的能力。最后,我们证明词汇学习难度依赖于模型输入中的频率。我们的研究结果提供了证据,表明在视觉语言环境中,通过使用非符号化的通用统计学习算法,无需任何语言意义的先验知识,即可学习功能词的含义。