The rapid proliferation of multimodal social media content has driven research in Multimodal Conversational Stance Detection (MCSD), which aims to interpret users' attitudes toward specific targets within complex discussions. However, existing studies remain limited by: **1) pseudo-multimodality**, where visual cues appear only in source posts while comments are treated as text-only, misaligning with real-world multimodal interactions; and **2) user homogeneity**, where diverse users are treated uniformly, neglecting personal traits that shape stance expression. To address these issues, we introduce **U-MStance**, the first user-centric MCSD dataset, containing over 40k annotated comments across six real-world targets. We further propose **PRISM**, a **P**ersona-**R**easoned mult**I**modal **S**tance **M**odel for MCSD. PRISM first derives longitudinal user personas from historical posts and comments to capture individual traits, then aligns textual and visual cues within conversational context via Chain-of-Thought to bridge semantic and pragmatic gaps across modalities. Finally, a mutual task reinforcement mechanism is employed to jointly optimize stance detection and stance-aware response generation for bidirectional knowledge transfer. Experiments on U-MStance demonstrate that PRISM yields significant gains over strong baselines, underscoring the effectiveness of user-centric and context-grounded multimodal reasoning for realistic stance understanding.
翻译:随着多模态社交媒体内容的快速激增,多模态对话立场检测(MCSD)研究应运而生,其目标是在复杂的讨论中解读用户对特定对象的态度。然而,现有研究仍存在以下局限:**1) 伪多模态性**,即视觉线索仅出现在源帖中,而评论被当作纯文本处理,这与现实世界中的多模态交互存在错位;**2) 用户同质化**,即不同用户被统一对待,忽视了塑造立场表达的个人特质。为解决这些问题,我们引入了**U-MStance**,首个以用户为中心的MCSD数据集,包含针对六个现实世界目标的超过4万条标注评论。我们进一步提出了**PRISM**,一种用于MCSD的**基于人物角色推理的多模态立场模型**。PRISM首先从历史帖子和评论中推导纵向用户角色以捕捉个体特质,然后通过思维链在对话上下文中对齐文本和视觉线索,以弥合跨模态的语义和语用鸿沟。最后,采用互任务强化机制联合优化立场检测和立场感知的回复生成,实现双向知识迁移。在U-MStance上的实验表明,PRISM相较于强基线模型取得了显著提升,凸显了以用户为中心和基于上下文的多模态推理对于现实立场理解的有效性。