Sentiment analysis is rapidly advancing by utilizing various data modalities (e.g., text, image). However, most previous works relied on superficial information, neglecting the incorporation of contextual world knowledge (e.g., background information derived from but beyond the given image and text pairs) and thereby restricting their ability to achieve better multimodal sentiment analysis. In this paper, we proposed a plug-in framework named WisdoM, designed to leverage contextual world knowledge induced from the large vision-language models (LVLMs) for enhanced multimodal sentiment analysis. WisdoM utilizes a LVLM to comprehensively analyze both images and corresponding sentences, simultaneously generating pertinent context. To reduce the noise in the context, we also introduce a training-free Contextual Fusion mechanism. Experimental results across diverse granularities of multimodal sentiment analysis tasks consistently demonstrate that our approach has substantial improvements (brings an average +1.89 F1 score among five advanced methods) over several state-of-the-art methods. Code will be released.
翻译:情感分析通过利用多种数据模态(如文本、图像)正在迅速发展。然而,先前的大多数工作依赖浅层信息,忽略了上下文世界知识(例如,源自但超越给定图像-文本对的背景信息)的融入,从而限制了它们在实现更优多模态情感分析方面的能力。本文提出了一种名为WisdoM的即插即用框架,旨在利用从大型视觉语言模型(LVLMs)中提取的上下文世界知识来增强多模态情感分析。WisdoM利用LVLM综合分析图像和对应句子,同时生成相关上下文。为减少上下文中的噪声,我们还引入了一种无需训练的上下文融合机制。在不同粒度多模态情感分析任务上的实验结果表明,我们的方法相较于多种最先进方法取得了显著提升(在五种先进方法中平均F1分数提高+1.89)。代码将公开。