While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
翻译:尽管深度神经网络在自动说话人识别及相关任务中展现出显著成效,但令人不满的是,人们对这些成效的确切成因知之甚少。先前研究将其部分成功归因于网络建模超音段时序信息的能力,即在频谱特征之外学习语音的韵律-节奏特征。本文中,我们:(i) 提出并应用一种新的测试方法,量化最新说话人识别神经网络的性能在多大程度上可由超音段时序信息建模来解释;(ii) 提出若干强制相应网络更关注超音段时序信息的方法,并评估其效果。我们发现,基于CNN和RNN的各类说话人识别神经网络架构,即便在强制条件下,也未能达到充分建模超音段时序信息的程度。这些结果为未来更有效利用完整语音信号的研究提供了高度相关的基础,并揭示了此类网络的内部工作机制,从而增强了深度学习在语音技术中的可解释性。