Evaluating Speaker Identity Coding in Self-supervised Models and Humans

Speaker identity plays a significant role in human communication and is being increasingly used in societal applications, many through advances in machine learning. Speaker identity perception is an essential cognitive phenomenon that can be broadly reduced to two main tasks: recognizing a voice or discriminating between voices. Several studies have attempted to identify acoustic correlates of identity perception to pinpoint salient parameters for such a task. Unlike other communicative social signals, most efforts have yielded inefficacious conclusions. Furthermore, current neurocognitive models of voice identity processing consider the bases of perception as acoustic dimensions such as fundamental frequency, harmonics-to-noise ratio, and formant dispersion. However, these findings do not account for naturalistic speech and within-speaker variability. Representational spaces of current self-supervised models have shown significant performance in various speech-related tasks. In this work, we demonstrate that self-supervised representations from different families (e.g., generative, contrastive, and predictive models) are significantly better for speaker identification over acoustic representations. We also show that such a speaker identification task can be used to better understand the nature of acoustic information representation in different layers of these powerful networks. By evaluating speaker identification accuracy across acoustic, phonemic, prosodic, and linguistic variants, we report similarity between model performance and human identity perception. We further examine these similarities by juxtaposing the encoding spaces of models and humans and challenging the use of distance metrics as a proxy for speaker proximity. Lastly, we show that some models can predict brain responses in Auditory and Language regions during naturalistic stimuli.

翻译：说话人身份在人类交流中扮演着重要角色，并随着机器学习技术的进步，日益广泛地应用于社会场景中。说话人身份感知是一种基本的认知现象，可大致归结为两项主要任务：识别特定声音或区分不同声音。多项研究尝试寻找身份感知的声学关联，以确定此类任务的关键参数。与其他交际性社会信号不同，多数研究未能得出有效结论。此外，当前关于声音身份处理的神经认知模型将感知基础视为声学维度，如基频、谐噪比和共振峰离散度。然而，这些发现未能解释自然语音及说话人内部变异性的影响。当前自监督模型的表征空间已在多种语音相关任务中展现出卓越性能。本研究证明，来自不同类别（如生成式、对比式和预测式模型）的自监督表征在说话人识别方面显著优于声学表征。我们还发现，此类说话人识别任务可用于深入理解这些强大网络不同层级中声学信息表征的本质。通过评估声学、音位、韵律和语言变体下的说话人识别准确率，我们报告了模型性能与人类身份感知之间的相似性。我们进一步通过并置模型与人类的编码空间，并质疑使用距离度量作为说话人邻近度代理的合理性，来检验这些相似性。最后，我们证明某些模型能够预测自然刺激下听觉与语言相关脑区的神经响应。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/