In affective computing, the task of Emotion Recognition in Conversations (ERC) has emerged as a focal area of research. The primary objective of this task is to predict emotional states within conversations by analyzing multimodal data including text, audio, and video. While existing studies have progressed in extracting and fusing representations from multimodal data, they often overlook the temporal dynamics in the data during conversations. To address this challenge, we have developed the SpikEmo framework, which is based on spiking neurons and employs a Semantic & Dynamic Two-stage Modeling approach to more precisely capture the complex temporal features of multimodal emotional data. Additionally, to tackle the class imbalance and emotional semantic similarity problems in the ERC tasks, we have devised an innovative combination of loss functions that significantly enhances the model's performance when dealing with ERC data characterized by long-tail distributions. Extensive experiments conducted on multiple ERC benchmark datasets demonstrate that SpikEmo significantly outperforms existing state-of-the-art methods in ERC tasks. Our code is available at https://github.com/Yu-xm/SpikEmo.git.
翻译:在情感计算领域,对话情感识别任务已成为一个重点研究方向。该任务的主要目标是通过分析文本、音频和视频等多模态数据,预测对话中的情感状态。现有研究在多模态数据的表征提取与融合方面已取得进展,但往往忽视了对话过程中数据的时序动态特性。为应对这一挑战,我们开发了基于脉冲神经元的SpikEmo框架,采用语义与动态两阶段建模方法,以更精确地捕捉多模态情感数据的复杂时序特征。此外,针对对话情感识别任务中存在的类别不平衡和情感语义相似性问题,我们设计了一种创新的损失函数组合,显著提升了模型在处理具有长尾分布特性的对话情感识别数据时的性能。在多个对话情感识别基准数据集上进行的大量实验表明,SpikEmo在对话情感识别任务中显著优于现有的先进方法。我们的代码公开于https://github.com/Yu-xm/SpikEmo.git。