GatedxLSTM：一种用于对话中情绪识别的多模态情感计算方法 (GatedxLSTM: A Multimodal Affective Computing Approach for Emotion Recognition in Conversations)

Affective Computing (AC) is essential for advancing Artificial General Intelligence (AGI), with emotion recognition serving as a key component. However, human emotions are inherently dynamic, influenced not only by an individual's expressions but also by interactions with others, and single-modality approaches often fail to capture their full dynamics. Multimodal Emotion Recognition (MER) leverages multiple signals but traditionally relies on utterance-level analysis, overlooking the dynamic nature of emotions in conversations. Emotion Recognition in Conversation (ERC) addresses this limitation, yet existing methods struggle to align multimodal features and explain why emotions evolve within dialogues. To bridge this gap, we propose GatedxLSTM, a novel speech-text multimodal ERC model that explicitly considers voice and transcripts of both the speaker and their conversational partner(s) to identify the most influential sentences driving emotional shifts. By integrating Contrastive Language-Audio Pretraining (CLAP) for improved cross-modal alignment and employing a gating mechanism to emphasise emotionally impactful utterances, GatedxLSTM enhances both interpretability and performance. Additionally, the Dialogical Emotion Decoder (DED) refines emotion predictions by modelling contextual dependencies. Experiments on the IEMOCAP dataset demonstrate that GatedxLSTM achieves state-of-the-art (SOTA) performance among open-source methods in four-class emotion classification. These results validate its effectiveness for ERC applications and provide an interpretability analysis from a psychological perspective.

翻译：情感计算对于推进通用人工智能至关重要，其中情绪识别是关键组成部分。然而，人类情绪本质上是动态的，不仅受个体表达的影响，也受与他人互动的影响，单模态方法往往无法捕捉其完整的动态特性。多模态情绪识别利用多种信号，但传统上依赖于话语层面的分析，忽略了对话中情绪的动态本质。对话中的情绪识别解决了这一局限，但现有方法难以对齐多模态特征并解释情绪在对话中演变的原因。为弥合这一差距，我们提出了GatedxLSTM，一种新颖的语音-文本多模态ERC模型，该模型明确考虑说话者及其对话伙伴的声音和转录文本，以识别驱动情绪转变的最具影响力的句子。通过集成对比语言-音频预训练以改进跨模态对齐，并采用门控机制来强调情感影响力强的话语，GatedxLSTM增强了可解释性和性能。此外，对话式情绪解码器通过建模上下文依赖关系来细化情绪预测。在IEMOCAP数据集上的实验表明，GatedxLSTM在四类情绪分类中实现了开源方法中最先进的性能。这些结果验证了其在ERC应用中的有效性，并从心理学角度提供了可解释性分析。