This paper presents a novel approach to processing multimodal data for dynamic emotion recognition, named as the Multimodal Masked Autoencoder for Dynamic Emotion Recognition (MultiMAE-DER). The MultiMAE-DER leverages the closely correlated representation information within spatiotemporal sequences across visual and audio modalities. By utilizing a pre-trained masked autoencoder model, the MultiMAEDER is accomplished through simple, straightforward finetuning. The performance of the MultiMAE-DER is enhanced by optimizing six fusion strategies for multimodal input sequences. These strategies address dynamic feature correlations within cross-domain data across spatial, temporal, and spatiotemporal sequences. In comparison to state-of-the-art multimodal supervised learning models for dynamic emotion recognition, MultiMAE-DER enhances the weighted average recall (WAR) by 4.41% on the RAVDESS dataset and by 2.06% on the CREMAD. Furthermore, when compared with the state-of-the-art model of multimodal self-supervised learning, MultiMAE-DER achieves a 1.86% higher WAR on the IEMOCAP dataset.
翻译:本文提出了一种处理多模态数据以实现动态情感识别的新方法,命名为面向动态情感识别的多模态掩码自编码器(MultiMAE-DER)。MultiMAE-DER利用视觉和音频模态中时空序列内紧密相关的表示信息。通过采用预训练的掩码自编码器模型,MultiMAE-DER仅需简单直接的微调即可实现。通过优化多模态输入序列的六种融合策略,进一步提升了MultiMAE-DER的性能。这些策略针对跨域数据中空间、时间及时空序列内的动态特征相关性进行处理。与当前最先进的动态情感识别多模态监督学习模型相比,MultiMAE-DER在RAVDESS数据集上将加权平均召回率(WAR)提升了4.41%,在CREMAD数据集上提升了2.06%。此外,与当前最先进的多模态自监督学习模型相比,MultiMAE-DER在IEMOCAP数据集上实现了1.86%的WAR提升。