Emotion recognition and sentiment analysis are pivotal tasks in speech and language processing, particularly in real-world scenarios involving multi-party, conversational data. This paper presents a multimodal approach to tackle these challenges on a well-known dataset. We propose a system that integrates four key modalities/channels using pre-trained models: RoBERTa for text, Wav2Vec2 for speech, a proposed FacialNet for facial expressions, and a CNN+Transformer architecture trained from scratch for video analysis. Feature embeddings from each modality are concatenated to form a multimodal vector, which is then used to predict emotion and sentiment labels. The multimodal system demonstrates superior performance compared to unimodal approaches, achieving an accuracy of 66.36% for emotion recognition and 72.15% for sentiment analysis.
翻译:情感识别与情感分析是语音与语言处理领域的关键任务,在多轮、多参与者的真实对话场景中尤为重要。本文提出一种多模态方法,在一个知名数据集上应对这些挑战。我们提出一个系统,该系统利用预训练模型整合了四个关键模态/通道:用于文本的RoBERTa、用于语音的Wav2Vec2、用于面部表情的所提FacialNet,以及一个从头开始训练的CNN+Transformer架构用于视频分析。来自各模态的特征嵌入被拼接以形成多模态向量,随后用于预测情感与情感标签。与单模态方法相比,该多模态系统展现出更优的性能,在情感识别任务上达到66.36%的准确率,在情感分析任务上达到72.15%的准确率。