Multi-modal stance detection (MSD) aims to determine an author's stance toward a given target using both textual and visual content. While recent methods leverage multi-modal fusion and prompt-based learning, most fail to distinguish between modality-specific signals and cross-modal evidence, leading to suboptimal performance. We propose DiME (Disentangled Multi-modal Experts), a novel architecture that explicitly separates stance information into textual-dominant, visual-dominant, and cross-modal shared components. DiME first uses a target-aware Chain-of-Thought prompt to generate reasoning-guided textual input. Then, dual encoders extract modality features, which are processed by three expert modules with specialized loss functions: contrastive learning for modality-specific experts and cosine alignment for shared representation learning. A gating network adaptively fuses expert outputs for final prediction. Experiments on four benchmark datasets show that DiME consistently outperforms strong unimodal and multi-modal baselines under both in-target and zero-shot settings.
翻译:多模态立场检测(MSD)旨在利用文本和视觉内容判断作者对给定目标的立场。尽管现有方法利用多模态融合和基于提示的学习,但大多未能区分模态特定信号与跨模态证据,导致性能欠佳。我们提出DiME(解耦多模态专家),一种新颖的架构,其将立场信息显式分离为文本主导、视觉主导和跨模态共享三个组件。DiME首先使用目标感知的思维链提示生成推理引导的文本输入。随后,双编码器提取模态特征,这些特征由三个具有专用损失函数的专家模块处理:模态特定专家采用对比学习,共享表示学习采用余弦对齐。一个门控网络自适应地融合专家输出以进行最终预测。在四个基准数据集上的实验表明,DiME在目标内和零样本设置下均持续优于强大的单模态和多模态基线方法。