Speech Emotion Recognition (SER) is to recognize human emotions in a natural verbal interaction scenario with machines, which is considered as a challenging problem due to the ambiguous human emotions. Despite the recent progress in SER, state-of-the-art models struggle to achieve a satisfactory performance. We propose a self-attention based method with combined use of label-adaptive mixup and center loss. By adapting label probabilities in mixup and fitting center loss to the mixup training scheme, our proposed method achieves a superior performance to the state-of-the-art methods.
翻译:语音情感识别(SER)旨在识别机器与人自然言语交互场景中的人类情感,由于人类情感具有模糊性,该问题被视为一项挑战性任务。尽管近年来语音情感识别领域取得了进展,但现有最先进模型仍难以达到令人满意的性能。我们提出了一种基于自注意力的方法,结合了标签自适应混合与中心损失函数。通过在混合过程中调整标签概率,并将中心损失函数适配于混合训练框架,我们的方法在性能上优于当前最先进的方法。