Multimodal sentiment analysis (MSA) is a research field that recognizes human sentiments by combining textual, visual, and audio modalities. The main challenge lies in integrating sentiment-related information from different modalities, which typically arises during the unimodal feature extraction phase and the multimodal feature fusion phase. Existing methods extract only shallow information from unimodal features during the extraction phase, neglecting sentimental differences across different personalities. During the fusion phase, they directly merge the feature information from each modality without considering differences at the feature level. This ultimately affects the model's recognition performance. To address this problem, we propose a personality-sentiment aligned multi-level fusion framework. We introduce personality traits during the feature extraction phase and propose a novel personality-sentiment alignment method to obtain personalized sentiment embeddings from the textual modality for the first time. In the fusion phase, we introduce a novel multi-level fusion method. This method gradually integrates sentimental information from textual, visual, and audio modalities through multimodal pre-fusion and a multi-level enhanced fusion strategy. Our method has been evaluated through multiple experiments on two commonly used datasets, achieving state-of-the-art results.
翻译:多模态情感分析(MSA)是通过融合文本、视觉和音频模态来识别人类情感的研究领域。其主要挑战在于整合来自不同模态的情感相关信息,这通常出现在单模态特征提取阶段和多模态特征融合阶段。现有方法在提取阶段仅从单模态特征中提取浅层信息,忽略了不同人格之间的情感差异。在融合阶段,它们直接合并各模态的特征信息,而未考虑特征层面的差异,最终影响了模型的识别性能。为解决这一问题,我们提出了一种人格-情感对齐的多层次融合框架。我们在特征提取阶段引入人格特质,并提出一种新颖的人格-情感对齐方法,首次从文本模态中获取个性化情感嵌入。在融合阶段,我们引入了一种新颖的多层次融合方法,该方法通过多模态预融合和层次增强融合策略,逐步整合来自文本、视觉和音频模态的情感信息。我们在两个常用数据集上通过多项实验评估了所提方法,并取得了最先进的结果。