One persistent challenge in deep learning based speech emotion recognition (SER) is the unconscious encoding of emotion-irrelevant factors (e.g., speaker or phonetic variability), which limits the generalization of SER in practical use. In this paper, we propose DSNet, a Disentangled Siamese Network with neutral calibration, to meet the demand for a more robust and explainable SER model. Specifically, we introduce an orthogonal feature disentanglement module to explicitly project the high-level representation into two distinct subspaces. Later, we propose a novel neutral calibration mechanism to encourage one subspace to capture sufficient emotion-irrelevant information. In this way, the other one can better isolate and emphasize the emotion-relevant information within speech signals. Experimental results on two popular benchmark datasets demonstrate the superiority of DSNet over various state-of-the-art methods for speaker-independent SER.
翻译:基于深度学习的语音情感识别(SER)中一个持续存在的挑战是情感无关因素(如说话人或语音变异性)的无意识编码,这限制了SER在实际应用中的泛化能力。本文提出DSNet——一种带有中性校准的解耦孪生网络,以满足对更鲁棒且可解释的SER模型的需求。具体而言,我们引入正交特征解耦模块,将高层表示显式投影到两个不同的子空间中。随后,我们提出一种新颖的中性校准机制,以激励其中一个子空间捕获足够的情感无关信息。通过这种方式,另一个子空间能更好地区分并强调语音信号中的情感相关信息。在两个主流基准数据集上的实验结果表明,DSNet在说话人无关的SER任务上优于多种最先进方法。