Vision and language models (VLMs) are currently the most generally performant architectures on multimodal tasks. Next to their predictions, they can also produce explanations, either in post-hoc or CoT settings. However, it is not clear how much they use the vision and text modalities when generating predictions or explanations. In this work, we investigate if VLMs rely on modalities differently when generating explanations as opposed to when they provide answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing tests and measures to VLM decoders. We find that VLMs are less self-consistent than LLMs. The text contributions in VL decoders are much larger than the image contributions across all measured tasks. And the contributions of the image are significantly larger for explanation generations than for answer generation. This difference is even larger in CoT compared to the post-hoc explanation setting. We also provide an up-to-date benchmarking of state-of-the-art VL decoders on the VALSE benchmark, which to date focused only on VL encoders. We find that VL decoders are still struggling with most phenomena tested by VALSE.
翻译:视觉与语言模型(VLM)目前是多模态任务中性能最通用的架构。除预测外,它们还能生成事后解释或思维链(CoT)解释。然而,尚不清楚它们在生成预测或解释时对视觉与文本模态的依赖程度。本研究探究VLM在生成解释与提供答案时是否对模态有不同依赖,并通过将现有测试与度量方法扩展至VLM解码器,评估其在事后解释与CoT解释两种场景下的自洽性。我们发现VLM的自洽性低于LLM。在所有测量任务中,文本对视觉语言解码器的贡献远大于图像。并且,图像在解释生成中的贡献显著大于答案生成,这种差异在CoT解释场景中比事后解释场景更显著。我们还对当前最先进的视觉语言解码器在VALSE基准上进行了最新基准测试——该基准此前仅聚焦于视觉语言编码器。结果表明,视觉语言解码器在处理VALSE测试的大多数现象时仍存在困难。