The goal of Speech Emotion Recognition (SER) is to enable computers to recognize the emotion category of a given utterance in the same way that humans do. The accuracy of SER is strongly dependent on the validity of the utterance-level representation obtained by the model. Nevertheless, the ``dark knowledge" carried by non-target classes is always ignored by previous studies. In this paper, we propose a hierarchical network, called DKDFMH, which employs decoupled knowledge distillation in a deep convolutional neural network with a fused multi-head attention mechanism. Our approach applies logit distillation to obtain higher-level semantic features from different scales of attention sets and delve into the knowledge carried by non-target classes, thus guiding the model to focus more on the differences between sentiment features. To validate the effectiveness of our model, we conducted experiments on the Interactive Emotional Dyadic Motion Capture (IEMOCAP) dataset. We achieved competitive performance, with 79.1% weighted accuracy (WA) and 77.1% unweighted accuracy (UA). To the best of our knowledge, this is the first time since 2015 that logit distillation has been returned to state-of-the-art status.
翻译:语音情感识别(SER)的目标是让计算机能够像人类一样识别给定语音的情感类别。SER的准确性高度依赖于模型所获得的语音级表示的有效性。然而,以往研究往往忽视了非目标类别所携带的“暗知识”。本文提出一种名为DKDFMH的层次化网络,该网络在融合多头注意力机制的深度卷积神经网络中采用了解耦知识蒸馏方法。我们利用逻辑蒸馏从不同尺度的注意力集合中提取更高级的语义特征,并深入挖掘非目标类别携带的知识,从而引导模型更关注情感特征间的差异。为验证模型有效性,我们在交互式情感二元动作捕捉(IEMOCAP)数据集上进行了实验。我们取得了具有竞争力的性能:加权准确率(WA)达到79.1%,非加权准确率(UA)达到77.1%。据我们所知,这是自2015年以来逻辑蒸馏首次回归到最先进水平。