While deep neural networks have shown impressive results in automatic speaker recognition and related tasks, it is dissatisfactory how little is understood about what exactly is responsible for these results. Part of the success has been attributed in prior work to their capability to model supra-segmental temporal information (SST), i.e., learn rhythmic-prosodic characteristics of speech in addition to spectral features. In this paper, we (i) present and apply a novel test to quantify to what extent the performance of state-of-the-art neural networks for speaker recognition can be explained by modeling SST; and (ii) present several means to force respective nets to focus more on SST and evaluate their merits. We find that a variety of CNN- and RNN-based neural network architectures for speaker recognition do not model SST to any sufficient degree, even when forced. The results provide a highly relevant basis for impactful future research into better exploitation of the full speech signal and give insights into the inner workings of such networks, enhancing explainability of deep learning for speech technologies.
翻译:尽管深度神经网络在自动说话人识别及相关任务中展现出显著成效,但我们对这些成果究竟源于何种机制的理解仍不尽如人意。先前研究将部分成功归因于其建模超音段时间信息的能力,即除频谱特征外还能学习语音的韵律-韵律特性。本文中,我们(i)提出并应用一种新型测试方法,量化最先进的说话人识别神经网络性能在多大程度上可归因于SST建模;(ii)提出若干强制神经网络模型更聚焦于SST的手段并评估其有效性。研究发现,基于CNN和RNN的多种说话人识别神经网络架构即使施加强制约束,也未能充分建模SST。该成果为未来更充分开发全语音信号的研究提供了高相关性基础,并深入揭示了此类网络的内部运行机制,增强了深度学习在语音技术领域的可解释性。