Speech emotion recognition (SER) has received a great deal of attention in recent years in the context of spontaneous conversations. While there have been notable results on datasets like the well known corpus of naturalistic dyadic conversations, IEMOCAP, for both the case of categorical and dimensional emotions, there are few papers which try to predict both paradigms at the same time. Therefore, in this work, we aim to highlight the performance contribution of multi-task learning by proposing a multi-task, multi-modal system that predicts categorical and dimensional emotions. The results emphasise the importance of cross-regularisation between the two types of emotions. Our approach consists of a multi-task, multi-modal architecture that uses parallel feature refinement through self-attention for the feature of each modality. In order to fuse the features, our model introduces a set of learnable bridge tokens that merge the acoustic and linguistic features with the help of cross-attention. Our experiments for categorical emotions on 10-fold validation yield results comparable to the current state-of-the-art. In our configuration, our multi-task approach provides better results compared to learning each paradigm separately. On top of that, our best performing model achieves a high result for valence compared to the previous multi-task experiments.
翻译:近年来,语音情感识别(SER)在自发对话场景中受到广泛关注。尽管在自然主义双人对话语料库IEMOCAP等数据集上,针对分类情绪和维度情绪均取得了显著成果,但鲜有研究尝试同时预测这两种范式。因此,本文旨在通过构建一个同时预测分类与维度情绪的多任务、多模态系统,强调多任务学习对性能的贡献。实验结果凸显了两种情绪类型之间交叉正则化的重要性。我们提出的方法采用多任务、多模态架构,通过自注意力机制对每个模态的特征进行并行优化。为融合特征,模型引入一组可学习的桥接标记,借助交叉注意力机制融合声学与语言特征。在10折交叉验证下,我们的分类情绪实验结果与当前最优方法相当。实验配置表明,与单独学习每种范式相比,我们的多任务方法能取得更优结果。此外,与以往多任务实验相比,我们的最佳模型在效价维度上实现了更高性能。