Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.
翻译:婴幼儿识别和分类物体的能力是逐渐发展的。在生命第二年中,更具语义性的视觉表征的出现与对词义理解的提升同步发生,这表明语言输入可能在塑造视觉表征中发挥重要作用。然而,即使在亲子游戏等适合词汇学习的语境中,照料者的言语也是稀疏且模糊的,常常指向与幼儿当前关注物体不同的对象。本研究系统探究了照料者言语如何在有限条件下依然能够增强视觉表征。为此,我们提出一个在亲子游戏过程中学习视觉表征的计算模型。我们构建了一个合成数据集,包含幼儿代理在家庭环境中移动和旋转玩具物体时所感知的自我中心图像,同时听到作为字幕建模的照料者言语。我们提出将幼儿的学习建模为同时对齐以下两种表征:1)时间邻近的图像之间,以及2)共现的图像与言语之间。研究表明,具有真实照料者言语统计特征的言语能产生支持改进类别识别的表征。分析揭示,与物体相关的命名频率的微小减少/增加会显著影响所学表征,这进一步影响言语中对物体名称的注意力——而该注意力正是实现高效视觉-语言对齐的必要条件。总体而言,我们的结果支持了以下假设:照料者的命名言语能够改善幼儿的视觉表征。