Stance detection, which aims to identify public opinion towards specific targets using social media data, is an important yet challenging task. With the proliferation of diverse multimodal social media content including text, and images multimodal stance detection (MSD) has become a crucial research area. However, existing MSD studies have focused on modeling stance within individual text-image pairs, overlooking the multi-party conversational contexts that naturally occur on social media. This limitation stems from a lack of datasets that authentically capture such conversational scenarios, hindering progress in conversational MSD. To address this, we introduce a new multimodal multi-turn conversational stance detection dataset (called MmMtCSD). To derive stances from this challenging dataset, we propose a novel multimodal large language model stance detection framework (MLLM-SD), that learns joint stance representations from textual and visual modalities. Experiments on MmMtCSD show state-of-the-art performance of our proposed MLLM-SD approach for multimodal stance detection. We believe that MmMtCSD will contribute to advancing real-world applications of stance detection research.
翻译:立场检测旨在利用社交媒体数据识别公众对特定目标的观点,是一项重要且具有挑战性的任务。随着包含文本和图像的多样化多模态社交媒体内容的激增,多模态立场检测已成为一个关键的研究领域。然而,现有的多模态立场检测研究主要集中于对单个文本-图像对内的立场进行建模,忽视了社交媒体上自然发生的多方对话语境。这一局限性源于缺乏真实捕捉此类对话场景的数据集,阻碍了对话式多模态立场检测的进展。为此,我们引入了一个新的多模态多轮对话立场检测数据集(称为MmMtCSD)。为了从这一具有挑战性的数据集中推导立场,我们提出了一种新颖的多模态大语言模型立场检测框架(MLLM-SD),该框架从文本和视觉模态中学习联合的立场表征。在MmMtCSD上的实验表明,我们提出的MLLM-SD方法在多模态立场检测上实现了最先进的性能。我们相信MmMtCSD将有助于推动立场检测研究的实际应用。