Speaker identity plays a significant role in human communication and is being increasingly used in societal applications, many through advances in machine learning. Speaker identity perception is an essential cognitive phenomenon that can be broadly reduced to two main tasks: recognizing a voice or discriminating between voices. Several studies have attempted to identify acoustic correlates of identity perception to pinpoint salient parameters for such a task. Unlike other communicative social signals, most efforts have yielded inefficacious conclusions. Furthermore, current neurocognitive models of voice identity processing consider the bases of perception as acoustic dimensions such as fundamental frequency, harmonics-to-noise ratio, and formant dispersion. However, these findings do not account for naturalistic speech and within-speaker variability. Representational spaces of current self-supervised models have shown significant performance in various speech-related tasks. In this work, we demonstrate that self-supervised representations from different families (e.g., generative, contrastive, and predictive models) are significantly better for speaker identification over acoustic representations. We also show that such a speaker identification task can be used to better understand the nature of acoustic information representation in different layers of these powerful networks. By evaluating speaker identification accuracy across acoustic, phonemic, prosodic, and linguistic variants, we report similarity between model performance and human identity perception. We further examine these similarities by juxtaposing the encoding spaces of models and humans and challenging the use of distance metrics as a proxy for speaker proximity. Lastly, we show that some models can predict brain responses in Auditory and Language regions during naturalistic stimuli.
翻译:说话人身份在人类交流中扮演着重要角色,并随着机器学习技术的进步,日益广泛地应用于社会场景中。说话人身份感知是一种基本的认知现象,可大致归结为两项主要任务:识别特定声音或区分不同声音。多项研究尝试寻找身份感知的声学关联,以确定此类任务的关键参数。与其他交际性社会信号不同,多数研究未能得出有效结论。此外,当前关于声音身份处理的神经认知模型将感知基础视为声学维度,如基频、谐噪比和共振峰离散度。然而,这些发现未能解释自然语音及说话人内部变异性的影响。当前自监督模型的表征空间已在多种语音相关任务中展现出卓越性能。本研究证明,来自不同类别(如生成式、对比式和预测式模型)的自监督表征在说话人识别方面显著优于声学表征。我们还发现,此类说话人识别任务可用于深入理解这些强大网络不同层级中声学信息表征的本质。通过评估声学、音位、韵律和语言变体下的说话人识别准确率,我们报告了模型性能与人类身份感知之间的相似性。我们进一步通过并置模型与人类的编码空间,并质疑使用距离度量作为说话人邻近度代理的合理性,来检验这些相似性。最后,我们证明某些模型能够预测自然刺激下听觉与语言相关脑区的神经响应。