Caregiver Talk Shapes Toddler Vision: A Computational Study of Dyadic Play

Infants' ability to recognize and categorize objects develops gradually. The second year of life is marked by both the emergence of more semantic visual representations and a better understanding of word meaning. This suggests that language input may play an important role in shaping visual representations. However, even in suitable contexts for word learning like dyadic play sessions, caregivers utterances are sparse and ambiguous, often referring to objects that are different from the one to which the child attends. Here, we systematically investigate to what extent caregivers' utterances can nevertheless enhance visual representations. For this we propose a computational model of visual representation learning during dyadic play. We introduce a synthetic dataset of ego-centric images perceived by a toddler-agent that moves and rotates toy objects in different parts of its home environment while hearing caregivers' utterances, modeled as captions. We propose to model toddlers' learning as simultaneously aligning representations for 1) close-in-time images and 2) co-occurring images and utterances. We show that utterances with statistics matching those of real caregivers give rise to representations supporting improved category recognition. Our analysis reveals that a small decrease/increase in object-relevant naming frequencies can drastically impact the learned representations. This affects the attention on object names within an utterance, which is required for efficient visuo-linguistic alignment. Overall, our results support the hypothesis that caregivers' naming utterances can improve toddlers' visual representations.

翻译：幼儿识别和分类物体的能力是逐步发展的。生命的第二年不仅标志着更具语义性的视觉表征的出现，也标志着对词义理解的增强。这表明语言输入可能在塑造视觉表征中发挥重要作用。然而，即使在适合词汇学习的双人游戏情境中，看护者的话语也稀疏且模糊，常常指向与幼儿当前关注的物体不同的对象。在此，我们系统探究看护者话语在何种程度上仍能增强视觉表征。为此，我们提出一个在双人游戏中学习视觉表征的计算机模型。我们引入一个合成数据集，包含幼儿代理感知的自我中心图像，该代理在其家庭环境的不同部分移动和旋转玩具物体，同时听到看护者的话语（建模为标题）。我们提出将幼儿的学习建模为同时对齐1）时间相近的图像和2）共现的图像与话语间的表征。我们发现，具有与真实看护者匹配统计特征的话语能够产生支持改进类别识别的表征。我们的分析揭示，物体相关命名频率的微小减少/增加会极大影响所学表征。这会影响对话语中物体名称的关注度，而该关注度是实现高效视觉-语言对齐所必需的。总体而言，我们的结果支持了看护者命名话语能改善幼儿视觉表征的假说。