Multimodal Sentiment Analysis (MSA) stands as a critical research frontier, seeking to comprehensively unravel human emotions by amalgamating text, audio, and visual data. Yet, discerning subtle emotional nuances within audio and video expressions poses a formidable challenge, particularly when emotional polarities across various segments appear similar. In this paper, our objective is to spotlight emotion-relevant attributes of audio and visual modalities to facilitate multimodal fusion in the context of nuanced emotional shifts in visual-audio scenarios. To this end, we introduce DEVA, a progressive fusion framework founded on textual sentiment descriptions aimed at accentuating emotional features of visual-audio content. DEVA employs an Emotional Description Generator (EDG) to transmute raw audio and visual data into textualized sentiment descriptions, thereby amplifying their emotional characteristics. These descriptions are then integrated with the source data to yield richer, enhanced features. Furthermore, DEVA incorporates the Text-guided Progressive Fusion Module (TPF), leveraging varying levels of text as a core modality guide. This module progressively fuses visual-audio minor modalities to alleviate disparities between text and visual-audio modalities. Experimental results on widely used sentiment analysis benchmark datasets, including MOSI, MOSEI, and CH-SIMS, underscore significant enhancements compared to state-of-the-art models. Moreover, fine-grained emotion experiments corroborate the robust sensitivity of DEVA to subtle emotional variations.
翻译:多模态情感分析(MSA)作为一个关键的研究前沿,旨在通过融合文本、音频和视觉数据全面解析人类情感。然而,识别音频和视频表达中微妙的情感差异是一项艰巨挑战,尤其当不同片段的情感极性看似相似时。本文旨在突出音频和视觉模态中与情感相关的属性,以促进在视觉-音频场景中细微情感变化背景下的多模态融合。为此,我们提出了DEVA,一种基于文本情感描述的渐进式融合框架,旨在增强视觉-音频内容的情感特征。DEVA采用情感描述生成器(EDG)将原始音频和视觉数据转化为文本化的情感描述,从而放大其情感特性。这些描述随后与源数据整合,以产生更丰富、增强的特征。此外,DEVA引入了文本引导的渐进融合模块(TPF),利用不同层次的文本作为核心模态指导。该模块逐步融合视觉-音频次要模态,以缓解文本与视觉-音频模态之间的差异。在广泛使用的情感分析基准数据集(包括MOSI、MOSEI和CH-SIMS)上的实验结果,相较于最先进的模型显示出显著提升。此外,细粒度情感实验证实了DEVA对细微情感变化的强大敏感性。