Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features

While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.

翻译：尽管深度神经网络在自动说话人识别及相关任务中展现出显著成效，但令人不满的是，人们对这些成效的确切成因知之甚少。先前研究将其部分成功归因于网络建模超音段时序信息的能力，即在频谱特征之外学习语音的韵律-节奏特征。本文中，我们：(i) 提出并应用一种新的测试方法，量化最新说话人识别神经网络的性能在多大程度上可由超音段时序信息建模来解释；(ii) 提出若干强制相应网络更关注超音段时序信息的方法，并评估其效果。我们发现，基于CNN和RNN的各类说话人识别神经网络架构，即便在强制条件下，也未能达到充分建模超音段时序信息的程度。这些结果为未来更有效利用完整语音信号的研究提供了高度相关的基础，并揭示了此类网络的内部工作机制，从而增强了深度学习在语音技术中的可解释性。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日