Vision-Language Models (VLMs) have shown strong multimodal reasoning capabilities on Visual-Question-Answering (VQA) benchmarks. However, their robustness against textual misinformation remains under-explored. While existing research has studied the effect of misinformation in text-only domains, it is not clear how VLMs arbitrate between contradictory information from different modalities. To bridge the gap, we first propose the CONTEXT-VQA (i.e., Conflicting Text) dataset, consisting of image-question pairs together with systematically generated persuasive prompts that deliberately conflict with visual evidence. Then, a thorough evaluation framework is designed and executed to benchmark the susceptibility of various models to these conflicting multimodal inputs. Comprehensive experiments over 11 state-of-the-art VLMs reveal that these models are indeed vulnerable to misleading textual prompts, often overriding clear visual evidence in favor of the conflicting text, and show an average performance drop of over 48.2% after only one round of persuasive conversation. Our findings highlight a critical limitation in current VLMs and underscore the need for improved robustness against textual manipulation.
翻译:视觉语言模型(VLMs)在视觉问答(VQA)基准测试中展现出强大的多模态推理能力。然而,其针对文本误导信息的鲁棒性仍未得到充分探索。现有研究虽已探讨了纯文本领域中误导信息的影响,但VLMs如何在不同模态的冲突信息之间进行仲裁尚不清楚。为填补这一空白,我们首先提出了CONTEXT-VQA(即冲突文本)数据集,该数据集包含图像-问题对以及系统生成的、刻意与视觉证据相矛盾的诱导性提示。随后,我们设计并执行了一个全面的评估框架,以基准测试各种模型对此类冲突多模态输入的敏感性。对11个前沿VLM进行的综合实验表明,这些模型确实容易受到误导性文本提示的影响,常常会覆盖清晰的视觉证据而倾向于冲突文本,且在仅经过一轮诱导性对话后,平均性能下降超过48.2%。我们的研究结果揭示了当前VLM的一个关键局限性,并强调需要提升针对文本操纵的鲁棒性。