Sentiment analysis is rapidly advancing by utilizing various data modalities (e.g., text, image). However, most previous works relied on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs) and thereby restricting their ability to achieve better multimodal sentiment analysis (MSA). In this paper, we proposed a plug-in framework named WisdoM, to leverage the contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced MSA. WisdoM utilizes LVLMs to comprehensively analyze both images and corresponding texts, simultaneously generating pertinent context. To reduce the noise in the context, we also introduce a training-free contextual fusion mechanism. Experiments across diverse granularities of MSA tasks consistently demonstrate that our approach has substantial improvements (brings an average +1.96% F1 score among five advanced methods) over several state-of-the-art methods.
翻译:情感分析通过利用多种数据模态(如文本、图像)正快速发展。然而,以往大多数工作仅依赖表层信息,忽略了上下文世界知识(例如源自但超越给定图像与文本对的背景信息)的融入,从而制约了其实现更优多模态情感分析(MSA)的能力。本文提出一个名为WisdoM的即插即用框架,旨在利用从大型视觉语言模型(LVLMs)中诱导的上下文世界知识来增强MSA。WisdoM利用LVLMs综合分析图像与对应文本,并同步生成相关上下文。为降低上下文中的噪声,我们还引入一种无需训练的上下文融合机制。跨不同粒度MSA任务的实验一致表明,我们的方法相较于多种现有最佳方法实现了显著提升(在五种先进方法中平均F1分数提升+1.96%)。