Emotion Prediction in Conversation (EPC) aims to forecast the emotions of forthcoming utterances by utilizing preceding dialogues. Previous EPC approaches relied on simple context modeling for emotion extraction, overlooking fine-grained emotion cues at the word level. Additionally, prior works failed to account for the intrinsic differences between modalities, resulting in redundant information. To overcome these limitations, we propose an emotional cues extraction and fusion network, which consists of two stages: a modality-specific learning stage that utilizes word-level labels and prosody learning to construct emotion embedding spaces for each modality, and a two-step fusion stage for integrating multi-modal features. Moreover, the emotion features extracted by our model are also applicable to the Emotion Recognition in Conversation (ERC) task. Experimental results validate the efficacy of the proposed method, demonstrating superior performance on both IEMOCAP and MELD datasets.
翻译:对话情感预测(EPC)旨在利用历史对话内容预测后续话语的情感状态。现有EPC方法多依赖简单的上下文建模进行情感特征提取,忽视了词语层面的细粒度情感线索。此外,先前研究未能充分考虑模态间的内在差异,导致信息冗余。为克服这些局限,本文提出一种情感线索提取与融合网络,其包含两个阶段:模态特异性学习阶段通过词语级标签与韵律学习构建各模态的情感嵌入空间,以及采用两步融合策略的多模态特征整合阶段。值得注意的是,本模型提取的情感特征同样适用于对话情感识别(ERC)任务。实验结果表明,所提方法在IEMOCAP和MELD数据集上均取得优越性能,验证了其有效性。