In this paper, we propose a novel deep inductive transfer learning framework, named feature distribution adaptation network, to tackle the challenging multi-modal speech emotion recognition problem. Our method aims to use deep transfer learning strategies to align visual and audio feature distributions to obtain consistent representation of emotion, thereby improving the performance of speech emotion recognition. In our model, the pre-trained ResNet-34 is utilized for feature extraction for facial expression images and acoustic Mel spectrograms, respectively. Then, the cross-attention mechanism is introduced to model the intrinsic similarity relationships of multi-modal features. Finally, the multi-modal feature distribution adaptation is performed efficiently with feed-forward network, which is extended using the local maximum mean discrepancy loss. Experiments are carried out on two benchmark datasets, and the results demonstrate that our model can achieve excellent performance compared with existing ones.Our code is available at https://github.com/shaokai1209/FDAN.
翻译:本文提出了一种新颖的深度归纳迁移学习框架,称为特征分布自适应网络,以解决具有挑战性的多模态语音情感识别问题。该方法旨在利用深度迁移学习策略对齐视觉与音频特征分布,从而获得一致的情感表征,进而提升语音情感识别的性能。在本模型中,预训练的ResNet-34分别用于面部表情图像和声学梅尔频谱图的特征提取。随后,引入交叉注意力机制以建模多模态特征的内在相似性关系。最后,通过前馈网络高效实现多模态特征分布自适应,该网络通过局部最大均值差异损失函数进行扩展。在两个基准数据集上进行的实验结果表明,与现有模型相比,本模型能够取得优异的性能。代码已发布于https://github.com/shaokai1209/FDAN。