Deep Neural Networks for Automatic Speaker Recognition Do Not Learn Supra-Segmental Temporal Features

While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.

翻译：尽管深度神经网络在自动说话人识别及相关任务中展现出显著成效，但我们对这些成果究竟源于何种机制的理解仍不尽如人意。先前研究将部分成功归因于其建模超音段时间信息的能力，即除频谱特征外还能学习语音的韵律-韵律特性。本文中，我们（i）提出并应用一种新型测试方法，量化最先进的说话人识别神经网络性能在多大程度上可归因于SST建模；（ii）提出若干强制神经网络模型更聚焦于SST的手段并评估其有效性。研究发现，基于CNN和RNN的多种说话人识别神经网络架构即使施加强制约束，也未能充分建模SST。该成果为未来更充分开发全语音信号的研究提供了高相关性基础，并深入揭示了此类网络的内部运行机制，增强了深度学习在语音技术领域的可解释性。

相关内容

声纹识别

关注 444

说话人识别（Speaker Recognition），或者称为声纹识别（Voiceprint Recognition, VPR），是根据语音中所包含的说话人个性信息，利用计算机以及现在的信息识别技术，自动鉴别说话人身份的一种生物特征识别技术。说话人识别研究的目的就是从语音中提取具有说话人表征性的特征，建立有效的模型和系统，实现自动精准的说话人鉴别。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日