Multimodal emotion recognition (MER) is a fundamental complex research problem due to the uncertainty of human emotional expression and the heterogeneity gap between different modalities. Audio and text modalities are particularly important for a human participant in understanding emotions. Although many successful attempts have been designed multimodal representations for MER, there still exist multiple challenges to be addressed: 1) bridging the heterogeneity gap between multimodal features and model inter- and intra-modal interactions of multiple modalities; 2) effectively and efficiently modelling the contextual dynamics in the conversation sequence. In this paper, we propose Cross-Modal RoBERTa (CM-RoBERTa) model for emotion detection from spoken audio and corresponding transcripts. As the core unit of the CM-RoBERTa, parallel self- and cross- attention is designed to dynamically capture inter- and intra-modal interactions of audio and text. Specially, the mid-level fusion and residual module are employed to model long-term contextual dependencies and learn modality-specific patterns. We evaluate the approach on the MELD dataset and the experimental results show the proposed approach achieves the state-of-art performance on the dataset.
翻译:多模态情感识别(MER)是一项基础性的复杂研究问题,其难点源于人类情感表达的不确定性以及不同模态之间的异构性鸿沟。对于理解人类情感而言,音频和文本模态尤为重要。尽管已有许多成功尝试为MER设计了多模态表征,但仍存在以下多重挑战:1)弥合多模态特征间的异构性鸿沟,并建模多模态间的交互与模态内动态关联;2)有效且高效地建模对话序列中的上下文动态变化。本文提出跨模态RoBERTa(CM-RoBERTa)模型,用于从口语音频及其对应转录文本中检测情感。作为CM-RoBERTa的核心单元,并行自注意力与交叉注意力机制被设计用于动态捕捉音频与文本的模态间交互与模态内关联。特别地,中层融合与残差模块被用于建模长期上下文依赖关系并学习模态特定模式。我们在MELD数据集上评估了该方法,实验结果表明,所提方法在该数据集上达到了最优性能。