Despite their success in speech processing, neural networks often operate as black boxes, prompting the question: what informs their decisions, and how can we interpret them? This work examines this issue in the context of lexical stress. A dataset of English disyllabic words was automatically constructed from read and spontaneous speech. Several Convolutional Neural Network (CNN) architectures were trained to predict stress position from a spectrographic representation of disyllabic words lacking minimal stress pairs (e.g., initial stress WAllet, final stress exTEND), achieving up to 92% accuracy on held-out test data. Layerwise Relevance Propagation (LRP), a technique for CNN interpretability analysis, revealed that predictions for held-out minimal pairs (PROtest vs. proTEST ) were most strongly influenced by information in stressed versus unstressed syllables, particularly the spectral properties of stressed vowels. However, the classifiers also attended to information throughout the word. A feature-specific relevance analysis is proposed, and its results suggest that our best-performing classifier is strongly influenced by the stressed vowel's first and second formants, with some evidence that its pitch and third formant also contribute. These results reveal deep learning's ability to acquire distributed cues to stress from naturally occurring data, extending traditional phonetic work based around highly controlled stimuli.
翻译:尽管神经网络在语音处理领域取得了显著成功,但其通常以黑箱方式运行,这引发了一个关键问题:是什么信息支撑了它们的决策,我们应如何解释这些决策?本研究在词汇重音的语境下探讨了这一问题。我们从朗读语音和自发语音中自动构建了一个英语双音节词数据集。我们训练了多种卷积神经网络(CNN)架构,旨在从缺乏最小重音对(例如,首音节重音的 WAllet 与末音节重音的 exTEND)的双音节词语谱图表示中预测重音位置,在留出的测试数据上达到了高达92%的准确率。层相关性传播(LRP)是一种用于CNN可解释性分析的技术,其分析显示,对于留出的最小对(PROtest 与 proTEST)的预测,最强烈地受到重读音节与非重读音节中信息的影响,尤其是重读元音的频谱特性。然而,分类器也关注了整个单词范围内的信息。我们提出了一种针对特定特征的相关性分析方法,其结果表明,我们性能最佳的分类器受到重读元音的第一和第二共振峰的强烈影响,并有证据表明其音高和第三共振峰也起到一定作用。这些结果揭示了深度学习能够从自然发生的数据中获取关于重音的分布式线索,从而扩展了基于高度受控刺激的传统语音学研究。