Multimodal emotion recognition (MMER) systems typically outperform unimodal systems by leveraging the inter- and intra-modal relationships between, e.g., visual, textual, physiological, and auditory modalities. This paper proposes an MMER method that relies on a joint multimodal transformer (JMT) for fusion with key-based cross-attention. This framework can exploit the complementary nature of diverse modalities to improve predictive accuracy. Separate backbones capture intra-modal spatiotemporal dependencies within each modality over video sequences. Subsequently, our JMT fusion architecture integrates the individual modality embeddings, allowing the model to effectively capture inter- and intra-modal relationships. Extensive experiments on two challenging expression recognition tasks -- (1) dimensional emotion recognition on the Affwild2 dataset (with face and voice) and (2) pain estimation on the Biovid dataset (with face and biosensors) -- indicate that our JMT fusion can provide a cost-effective solution for MMER. Empirical results show that MMER systems with our proposed fusion allow us to outperform relevant baseline and state-of-the-art methods.
翻译:多模态情感识别系统通常通过利用视觉、文本、生理和听觉等模态之间的跨模态与模态内关系,其性能优于单模态系统。本文提出一种基于联合多模态Transformer的多模态情感识别方法,采用基于键的交叉注意力机制进行融合。该框架能够利用不同模态的互补特性提升预测精度。独立骨干网络分别捕获视频序列中各模态的时空依赖性。随后,所提出的JMT融合架构整合各模态嵌入表征,使模型有效捕捉模态间与模态内的关联。在两个具有挑战性的表情识别任务——(1)基于Affwild2数据集(含面部与语音)的维度情感识别与(2)基于Biovid数据集(含面部与生物传感器)的疼痛估计——上的大量实验表明,所提出的JMT融合方案可为多模态情感识别提供高性价比解决方案。实证结果证明,采用所提融合策略的多模态情感识别系统能够超越相关基线方法与现有最优方法。