Even when factually correct, social-media news previews (image-headline pairs) can induce interpretation drift: by selectively omitting crucial context, they lead readers to form judgments that diverge from what the full article conveys. This covert harm is harder to detect than explicit misinformation yet remains underexplored. To address this gap, we develop a multi-stage pipeline that disentangles and simulates preview-based versus context-based understanding, enabling construction of the MM-Misleading benchmark. Using this benchmark, we systematically evaluate open-source LVLMs and uncover pronounced blind spots to omission-based misleadingness detection. We further propose OMGuard, which integrates (1) Interpretation-Aware Fine-Tuning, which used to improve multimodal misleadingness detection and (2) Rationale-Guided Misleading Content Correction, which uses explicit rationales to guide headline rewriting and reduce misleading impressions. Experiments show that OMGuard lifts an 8B model's detection accuracy to match a 235B LVLM and delivers markedly stronger end-to-end correction. Further analysis reveals that misleadingness typically stems from local narrative shifts (e.g., missing background) rather than global frame changes, and identifies image-driven scenarios where text-only correction fails, highlighting the necessity of visual interventions.
翻译:即便事实准确,社交媒体新闻预览(图像-标题对)仍可能引发解读偏差:通过选择性省略关键背景信息,它们会导致读者形成与完整文章内容相悖的判断。这种隐性危害比显性虚假信息更难检测,却尚未得到充分研究。为填补这一空白,我们开发了一个多阶段处理流程,该流程能解耦并模拟基于预览的理解与基于上下文的理解,从而构建了MM-Misleading基准数据集。利用此基准,我们系统评估了开源大型视觉语言模型,并揭示了它们在基于遗漏的误导性检测方面存在显著盲区。我们进一步提出了OMGuard,它整合了(1)用于提升多模态误导性检测能力的“解读感知微调”,以及(2)利用显性推理来指导标题重写、减少误导印象的“推理引导误导内容修正”。实验表明,OMGuard将一款80亿参数模型的检测准确率提升至与一款2350亿参数的大型视觉语言模型相当的水平,并提供了显著更强的端到端修正能力。进一步分析揭示,误导性通常源于局部叙事转变(例如,缺失背景信息)而非全局框架改变,并识别出仅依赖文本修正会失效的图像驱动场景,凸显了视觉干预的必要性。