Vision and language model (VLM) decoders are currently the best-performing architectures on multimodal tasks. Next to predictions, they can also produce explanations, either in post-hoc or CoT settings. However, it is not clear how much they use the vision and text modalities when generating predictions or explanations. In this work, we investigate if VLMs rely on modalities differently when they produce explanations as opposed to providing answers. We also evaluate the self-consistency of VLM decoders in both post-hoc and CoT explanation settings, by extending existing unimodal tests and measures to VLM decoders. We find that VLMs are less self-consistent than LLMs. Text contributions in VL decoders are more important than image contributions in all examined tasks. Moreover, the contributions of images are significantly stronger for explanation generation compared to answer generation. This difference is even larger in CoT compared to post-hoc explanations. Lastly, we provide an up-to-date benchmarking of state-of-the-art VL decoders on the VALSE benchmark, which before only covered VL encoders. We find that VL decoders still struggle with most phenomena tested by VALSE.
翻译:视觉与语言模型(VLM)解码器是目前在多模态任务上表现最优的架构。除了生成预测外,它们还能在事后解释或思维链(CoT)场景中生成解释。然而,目前尚不清楚它们在生成预测或解释时对视觉和文本模态的依赖程度。本研究探讨VLM在生成解释与提供答案时对模态的依赖是否存在差异。同时,我们通过将现有的单模态测试和度量方法扩展到VLM解码器,评估其在事后解释和CoT解释场景中的自一致性。研究发现,VLM的自一致性低于大语言模型(LLM)。在所有受测任务中,视觉-语言解码器中文本的贡献均大于图像的贡献。此外,与答案生成相比,图像对解释生成的贡献显著更大;这种差异在CoT解释中比在事后解释中更为明显。最后,我们针对VALSE基准(此前仅涵盖视觉-语言编码器)对当前最先进的视觉-语言解码器进行了最新基准测试,发现这些解码器在VALSE测试的多数现象上仍存在困难。