Natural and artificial audition can in principle acquire different solutions to a given problem. The constraints of the task, however, can nudge the cognitive science and engineering of audition to qualitatively converge, suggesting that a closer mutual examination would potentially enrich artificial hearing systems and process models of the mind and brain. Speech recognition - an area ripe for such exploration - is inherently robust in humans to a number transformations at various spectrotemporal granularities. To what extent are these robustness profiles accounted for by high-performing neural network systems? We bring together experiments in speech recognition under a single synthesis framework to evaluate state-of-the-art neural networks as stimulus-computable, optimized observers. In a series of experiments, we (1) clarify how influential speech manipulations in the literature relate to each other and to natural speech, (2) show the granularities at which machines exhibit out-of-distribution robustness, reproducing classical perceptual phenomena in humans, (3) identify the specific conditions where model predictions of human performance differ, and (4) demonstrate a crucial failure of all artificial systems to perceptually recover where humans do, suggesting alternative directions for theory and model building. These findings encourage a tighter synergy between the cognitive science and engineering of audition.
翻译:自然听觉与人工听觉原则上可为同一问题获取不同的解决方案。然而,任务约束可能促使听觉认知科学与工程在定性上趋于收敛,表明对二者的更紧密的相互审视有望丰富人工听觉系统以及心智与大脑的过程模型。语音识别——一个正待此类探索的领域——在人类中天然地对多种频谱-时间粒度的变换具有鲁棒性。高性能神经网络系统能在多大程度上复现这些鲁棒性特征?我们整合了统一合成框架下的语音识别实验,将最先进的神经网络作为可刺激计算的优化观测器进行评估。在一系列实验中,我们(1)厘清了文献中关键语音操控方法之间的关联及其与自然语音的关系;(2)展示了机器在何种粒度上表现出分布外鲁棒性并复现了人类经典感知现象;(3)识别了模型对人类表现预测产生差异的具体条件;(4)揭示了所有人工系统在人类能够进行感知恢复时均存在关键性失效,从而为理论与模型构建指明了替代方向。这些发现促进了听觉认知科学与工程之间的更紧密结合。