AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in Group Conversations

Analyzing individual emotions during group conversation is crucial in developing intelligent agents capable of natural human-machine interaction. While reliable emotion recognition techniques depend on different modalities (text, audio, video), the inherent heterogeneity between these modalities and the dynamic cross-modal interactions influenced by an individual's unique behavioral patterns make the task of emotion recognition very challenging. This difficulty is compounded in group settings, where the emotion and its temporal evolution are not only influenced by the individual but also by external contexts like audience reaction and context of the ongoing conversation. To meet this challenge, we propose a Multimodal Attention Network that captures cross-modal interactions at various levels of spatial abstraction by jointly learning its interactive bunch of mode-specific Peripheral and Central networks. The proposed MAN injects cross-modal attention via its Peripheral key-value pairs within each layer of a mode-specific Central query network. The resulting cross-attended mode-specific descriptors are then combined using an Adaptive Fusion technique that enables the model to integrate the discriminative and complementary mode-specific data patterns within an instance-specific multimodal descriptor. Given a dialogue represented by a sequence of utterances, the proposed AMuSE model condenses both spatial and temporal features into two dense descriptors: speaker-level and utterance-level. This helps not only in delivering better classification performance (3-5% improvement in Weighted-F1 and 5-7% improvement in Accuracy) in large-scale public datasets but also helps the users in understanding the reasoning behind each emotion prediction made by the model via its Multimodal Explainability Visualization module.

翻译：分析群体对话中的个体情绪对于开发能够实现自然人机交互的智能体至关重要。尽管可靠的情绪识别技术依赖于多种模态（文本、音频、视频），但这些模态间的固有异质性以及受个体独特行为模式影响的动态跨模态交互，使得情绪识别任务极具挑战性。在群体情境中，这种困难更为突出，因为情绪及其时间演变不仅受个体自身影响，还受到外部语境（如听众反应和当前对话背景）的驱动。为应对这一挑战，我们提出了一种多模态注意力网络，该网络通过联合学习模态特定的外围网络与核心网络的交互集群，在不同空间抽象层级捕获跨模态交互。所提出的多模态注意力网络通过其外围键值对，在模态特定核心查询网络的每一层注入跨模态注意力。随后，生成的跨注意力模态特定描述符通过自适应融合技术进行整合，使模型能够将判别性且互补的模态特定数据模式融入实例特定的多模态描述符中。针对由一系列话语表征的对话，所提出的AMuSE模型将空间与时间特征压缩为两个密集描述符：说话人级和话语级。这不仅在大型公开数据集上实现了更优的分类性能（加权F1值提升3-5%，准确率提升5-7%），还通过多模态可解释性可视化模块帮助用户理解模型每次情绪预测背后的推理逻辑。