Compared to traditional sentiment analysis, which only considers text, multimodal sentiment analysis needs to consider emotional signals from multimodal sources simultaneously and is therefore more consistent with the way how humans process sentiment in real-world scenarios. It involves processing emotional information from various sources such as natural language, images, videos, audio, physiological signals, etc. However, although other modalities also contain diverse emotional cues, natural language usually contains richer contextual information and therefore always occupies a crucial position in multimodal sentiment analysis. The emergence of ChatGPT has opened up immense potential for applying large language models (LLMs) to text-centric multimodal tasks. However, it is still unclear how existing LLMs can adapt better to text-centric multimodal sentiment analysis tasks. This survey aims to (1) present a comprehensive review of recent research in text-centric multimodal sentiment analysis tasks, (2) examine the potential of LLMs for text-centric multimodal sentiment analysis, outlining their approaches, advantages, and limitations, (3) summarize the application scenarios of LLM-based multimodal sentiment analysis technology, and (4) explore the challenges and potential research directions for multimodal sentiment analysis in the future.
翻译:与传统仅考虑文本的情感分析相比,多模态情感分析需要同时考虑来自多模态源的情感信号,因此更符合人类在现实场景中处理情感的方式。它涉及处理来自自然语言、图像、视频、音频、生理信号等多种来源的情感信息。然而,尽管其他模态也包含多样的情感线索,自然语言通常包含更丰富的上下文信息,因此在多模态情感分析中始终占据关键地位。ChatGPT的出现为将大语言模型应用于以文本为中心的多模态任务开辟了巨大潜力。然而,现有的大语言模型如何能更好地适应以文本为中心的多模态情感分析任务,目前仍不清楚。本综述旨在(1)全面回顾以文本为中心的多模态情感分析任务的最新研究,(2)审视大语言模型在以文本为中心的多模态情感分析中的潜力,概述其方法、优势与局限,(3)总结基于大语言模型的多模态情感分析技术的应用场景,以及(4)探讨未来多模态情感分析面临的挑战与潜在研究方向。