Neural language models (LMs) are arguably less data-efficient than humans -- why does this gap occur? In this study, we hypothesize that this gap stems from the learners' accessibility to modalities other than text, specifically, vision. We conducted two complementary experiments (using noisy, realistic data and a simplified, artificial one) toward the advantage of vision in the syntactic generalization of LMs. Our results showed that vision accelerated a proper linguistic generalization in the simplified, artificial setting, but LMs struggled with the noisy, realistic setting. These mixed results indicate several possibilities, e.g., vision can potentially boost language acquisition, but learners' additional visual/linguistic prior knowledge should be needed to robustly make use of raw images for efficient language acquisition.
翻译:神经语言模型(LM)的数据效率明显低于人类——这种差距为何产生?本研究假设,这种差距源于学习者对文本之外模态(具体而言,视觉)的可及性。我们通过两项互补实验(分别采用含噪的真实数据与简化的合成数据),探究视觉在语言模型句法泛化中的优势。结果表明,在简化的人工设定下,视觉加速了合理的语言泛化;但在含噪的真实设定中,语言模型却难以实现相同效果。这些混合结果揭示了多种可能性,例如视觉可能潜在地促进语言习得,但学习者需要具备额外的视觉/语言先验知识,才能稳健地利用原始图像实现高效语言习得。