Emotion Recognition in Conversations (ERC) presents unique challenges, requiring models to capture the temporal flow of multi-turn dialogues and to effectively integrate cues from multiple modalities. We propose Mixture of Speech-Text Experts for Recognition of Emotions (MiSTER-E), a modular Mixture-of-Experts (MoE) framework designed to decouple two core challenges in ERC: modality-specific context modeling and multimodal information fusion. MiSTER-E leverages large language models (LLMs) fine-tuned for both speech and text to provide rich utterance-level embeddings, which are then enhanced through a convolutional-recurrent context modeling layer. The system integrates predictions from three experts-speech-only, text-only, and cross-modal-using a learned gating mechanism that dynamically weighs their outputs. To further encourage consistency and alignment across modalities, we introduce a supervised contrastive loss between paired speech-text representations and a KL-divergence-based regulariza-tion across expert predictions. Importantly, MiSTER-E does not rely on speaker identity at any stage. Experiments on three benchmark datasets-IEMOCAP, MELD, and MOSI-show that our proposal achieves 70.9%, 69.5%, and 87.9% weighted F1-scores respectively, outperforming several baseline speech-text ERC systems. We also provide various ablations to highlight the contributions made in the proposed approach.
翻译:对话情感识别(ERC)面临独特挑战,需要模型能够捕捉多轮对话的时间动态,并有效整合来自多模态的线索。我们提出用于情感识别的语音-文本专家混合模型(MiSTER-E),这是一种模块化的专家混合(MoE)框架,旨在解耦ERC中的两个核心挑战:模态特定上下文建模与多模态信息融合。MiSTER-E利用针对语音和文本分别微调的大语言模型(LLMs)来提供丰富的语句级嵌入表示,这些表示通过卷积-循环上下文建模层进一步强化。该系统通过可学习的门控机制动态加权三个专家——纯语音、纯文本与跨模态专家——的预测输出。为促进跨模态表征的一致性与对齐,我们引入了配对语音-文本表征间的监督对比损失,以及基于KL散度的专家预测正则化方法。值得注意的是,MiSTER-E在任何阶段均不依赖说话人身份信息。在三个基准数据集——IEMOCAP、MELD与MOSI——上的实验表明,我们的方法分别取得了70.9%、69.5%与87.9%的加权F1分数,优于多个语音-文本ERC基线系统。我们还通过多项消融实验验证了所提方法中各组件的贡献。