Humans are able to fuse information from both auditory and visual modalities to help with understanding speech. This is frequently demonstrated through an phenomenon known as the McGurk Effect, during which a listener is presented with incongruent auditory and visual speech that fuse together into the percept of an illusory intermediate phoneme. Building on a recent framework that proposes how to address developmental 'why' questions using artificial neural networks, we evaluated a set of recent artificial neural networks trained on audiovisual speech by testing them with audiovisually incongruent words designed to elicit the McGurk effect. We compared networks trained on clean speech to those trained on noisy speech, and discovered that training with noisy speech led to an increase in both visual responses and McGurk responses across all models. Furthermore, we observed that systematically increasing the level of auditory noise during ANN training also increased the amount of audiovisual integration up to a point, but at extreme noise levels, this integration failed to develop. These results suggest that excessive noise exposure during critical periods of audiovisual learning may negatively influence the development of audiovisual speech integration. This work also demonstrates that the McGurk effect reliably emerges untrained from the behaviour of both supervised and unsupervised networks. This supports the notion that artificial neural networks might be useful models for certain aspects of perception and cognition.
翻译:人类能够融合听觉和视觉模态的信息以辅助语音理解。这一能力常通过被称为麦格克效应的现象得以验证:当听者接收到不一致的听觉与视觉语音刺激时,会融合感知为一种虚幻的中间音素。基于近期提出的利用人工神经网络探讨发展性"为何"问题的研究框架,我们通过设计可诱发麦格克效应的视听不一致词汇,测试了一系列在视听语音数据上训练的最新人工神经网络。通过比较在清晰语音上训练的网络与在噪声语音上训练的网络,我们发现所有模型在噪声语音训练后均表现出更强的视觉响应和麦格克响应。进一步研究发现,在ANN训练过程中系统性地增强听觉噪声水平会提升视听整合程度,但存在临界点;当噪声达到极端水平时,这种整合能力无法形成。这些结果表明,在视听学习的关键期过度暴露于噪声环境可能对视听语音整合能力的发展产生负面影响。本研究还证明,麦格克效应能够未经专门训练地从监督式与非监督式网络的行为中自发涌现,这支持了人工神经网络可作为感知与认知某些方面的有效计算模型的观点。