Deep learning architectures have made significant progress in terms of performance in many research areas. The automatic speech recognition (ASR) field has thus benefited from these scientific and technological advances, particularly for acoustic modeling, now integrating deep neural network architectures. However, these performance gains have translated into increased complexity regarding the information learned and conveyed through these black-box architectures. Following many researches in neural networks interpretability, we propose in this article a protocol that aims to determine which and where information is located in an ASR acoustic model (AM). To do so, we propose to evaluate AM performance on a determined set of tasks using intermediate representations (here, at different layer levels). Regarding the performance variation and targeted tasks, we can emit hypothesis about which information is enhanced or perturbed at different architecture steps. Experiments are performed on both speaker verification, acoustic environment classification, gender classification, tempo-distortion detection systems and speech sentiment/emotion identification. Analysis showed that neural-based AMs hold heterogeneous information that seems surprisingly uncorrelated with phoneme recognition, such as emotion, sentiment or speaker identity. The low-level hidden layers globally appears useful for the structuring of information while the upper ones would tend to delete useless information for phoneme recognition.
翻译:深度学习架构在许多研究领域取得了显著的性能进步。自动语音识别领域也受益于这些科学技术进展,特别是在声学建模方面,现已整合了深度神经网络架构。然而,这些性能提升带来了信息复杂度增加的问题,这些信息通过黑箱架构学习并传递。借鉴神经网络可解释性方面的多项研究,本文提出了一种旨在确定自动语音识别声学模型中信息的位置和类型的协议。为此,我们建议利用中间表示(这里指不同层级的表示)在特定任务集上评估声学模型的性能。根据性能变化和目标任务,我们可以推断在不同架构步骤中哪些信息被增强或扰动。实验涵盖说话人验证、声学环境分类、性别分类、语速失真检测系统以及语音情感/情绪识别。分析表明,基于神经网络的声学模型包含异构信息,这些信息与音素识别出人意料地不相关,例如情绪、情感或说话人身份。低层隐藏层总体上似乎有助于信息结构化,而高层隐藏层则倾向于删除对音素识别无用的信息。