Systems for multimodal emotion recognition (MMER) can typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. In this paper, an MMER method is proposed that relies on a joint multimodal transformer for fusion with key-based cross-attention. This framework aims to exploit the diverse and complementary nature of different modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, a joint multimodal transformer fusion architecture integrates the individual modality embeddings, allowing the model to capture inter-modal and intra-modal relationships effectively. Extensive experiments on two challenging expression recognition tasks: (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice), and (2) pain estimation on the Biovid dataset (with face and biosensors), indicate that the proposed method can work effectively with different modalities. Empirical results show that MMER systems with our proposed fusion method allow us to outperform relevant baseline and state-of-the-art methods.
翻译:多模态情感识别(MMER)系统通过利用视觉、文本、生理和听觉等模态之间的跨模态与模态内关系,通常能够优于单模态系统。本文提出了一种基于联合多模态Transformer的MMER方法,采用基于键的交叉注意力机制进行融合。该框架旨在利用不同模态的多样性与互补性来提高预测精度。独立的骨干网络分别捕获视频序列中每个模态的模态内时空依赖关系。随后,一种联合多模态Transformer融合架构整合各模态嵌入,使模型能够有效捕获跨模态与模态内关系。在两个具有挑战性的表情识别任务上进行了大量实验:(1)在Affwild2数据集上进行维度情感识别(包含面部和语音);(2)在Biovid数据集上进行疼痛估计(包含面部和生物传感器)。实验结果表明,所提方法能够有效处理不同模态。实证结果显示,采用本文所提出的融合方法的MMER系统能够超越相关基线方法及当前最先进方法。