Which phonemes convey more speaker traits is a long-standing question, and various perception experiments were conducted with human subjects. For speaker recognition, studies were conducted with the conventional statistical models and the drawn conclusions are more or less consistent with the perception results. However, which phonemes are more important with modern deep neural models is still unexplored, due to the opaqueness of the decision process. This paper conducts a novel study for the attribution of phonemes with two types of deep speaker models that are based on TDNN and CNN respectively, from the perspective of model explanation. Specifically, we conducted the study by two post-explanation methods: LayerCAM and Time Align Occlusion (TAO). Experimental results showed that: (1) At the population level, vowels are more important than consonants, confirming the human perception studies. However, fricatives are among the most unimportant phonemes, which contrasts with previous studies. (2) At the speaker level, a large between-speaker variation is observed regarding phoneme importance, indicating that whether a phoneme is important or not is largely speaker-dependent.
翻译:哪些音素承载了更多的说话人特性是一个长期存在的问题,并已通过多种针对人类受试者的感知实验进行了研究。在说话人识别领域,研究者利用传统统计模型开展了相关研究,所得结论与感知结果基本一致。然而,由于现代深度神经模型决策过程的不透明性,哪些音素在深度模型中更为重要仍未被探索。本文从模型解释的视角,针对基于TDNN和CNN的两种深度说话人模型,开展了关于音素归因的原创性研究。具体而言,我们采用两种事后解释方法——LayerCAM和时间对齐遮挡(TAO)——进行了实验。结果表明:(1)在总体层面,元音比辅音更重要,这证实了人类感知研究的结果;然而,擦音是重要性最低的音素类别之一,这与先前的研究形成对比。(2)在说话人层面,不同说话人之间的音素重要性存在显著差异,表明一个音素是否重要在很大程度上取决于说话人本身。