Why context matters in VQA and Reasoning: Semantic interventions for VLM input modalities

The various limitations of Generative AI, such as hallucinations and model failures, have made it crucial to understand the role of different modalities in Visual Language Model (VLM) predictions. Our work investigates how the integration of information from image and text modalities influences the performance and behavior of VLMs in visual question answering (VQA) and reasoning tasks. We measure this effect through answer accuracy, reasoning quality, model uncertainty, and modality relevance. We study the interplay between text and image modalities in different configurations where visual content is essential for solving the VQA task. Our contributions include (1) the Semantic Interventions (SI)-VQA dataset, (2) a benchmark study of various VLM architectures under different modality configurations, and (3) the Interactive Semantic Interventions (ISI) tool. The SI-VQA dataset serves as the foundation for the benchmark, while the ISI tool provides an interface to test and apply semantic interventions in image and text inputs, enabling more fine-grained analysis. Our results show that complementary information between modalities improves answer and reasoning quality, while contradictory information harms model performance and confidence. Image text annotations have minimal impact on accuracy and uncertainty, slightly increasing image relevance. Attention analysis confirms the dominant role of image inputs over text in VQA tasks. In this study, we evaluate state-of-the-art VLMs that allow us to extract attention coefficients for each modality. A key finding is PaliGemma's harmful overconfidence, which poses a higher risk of silent failures compared to the LLaVA models. This work sets the foundation for rigorous analysis of modality integration, supported by datasets specifically designed for this purpose.

翻译：生成式人工智能的各种局限性，如幻觉和模型失效，使得理解不同模态在视觉语言模型预测中的作用变得至关重要。本研究探讨了图像与文本模态信息的整合如何影响视觉语言模型在视觉问答和推理任务中的性能与行为。我们通过答案准确性、推理质量、模型不确定性和模态相关性来衡量这种影响。我们研究了在视觉内容对解决视觉问答任务至关重要的不同配置下，文本与图像模态之间的相互作用。我们的贡献包括：（1）语义干预视觉问答数据集，（2）不同模态配置下各种视觉语言模型架构的基准研究，以及（3）交互式语义干预工具。语义干预视觉问答数据集是基准测试的基础，而交互式语义干预工具则提供了一个界面，用于测试和应用图像与文本输入中的语义干预，从而实现更细粒度的分析。我们的结果表明，模态间的互补信息能提高答案和推理质量，而矛盾信息则会损害模型性能和置信度。图像文本标注对准确性和不确定性的影响很小，仅略微增加了图像相关性。注意力分析证实了在视觉问答任务中，图像输入相对于文本占据主导地位。在本研究中，我们评估了允许提取各模态注意力系数的先进视觉语言模型。一个关键发现是PaliGemma模型有害的过度自信，与LLaVA模型相比，其存在更高的静默失效风险。这项工作为模态整合的严格分析奠定了基础，并得到了为此目的专门设计的数据集的支持。