Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

翻译：解释机制越来越多地被用于支持视觉语言模型（VLM）的透明性和可信度，特别是在需要人工监督模型决策的场景中。然而，这些解释的鲁棒性仍未被充分理解。在本工作中，我们研究了VLM（尤其是基于CLIP的模型）中的解释热力图是否能在对抗条件下忠实地反映模型推理过程。我们发现，解释图可以被系统性操纵，同时保持模型的原始预测不变，这揭示了预测行为与解释忠实性之间的脱节。为研究这一脆弱性，我们提出了X-Shift，一种新型灰盒攻击方法，通过扰动补丁级视觉表示，在不改变预测输出的情况下将解释热力图重定向到语义无关的区域。与旨在诱发错误分类的传统对抗攻击不同，X-Shift专门针对解释过程本身的完整性。该攻击无需修改模型参数，且可推广到多种CLIP架构和解释方法。我们在ImageNet-1k、MS-COCO和Flickr30K数据集上评估了所提方法，结果表明在不可察觉的扰动下，解释对齐性持续降低，而预测稳定性得以保持。此外，即使采用显著更大的扰动预算，以预测为导向的经典对抗攻击也无法复现同样的解释偏移行为。我们的发现揭示了当前VLM解释机制的根本局限性，并对其在高风险应用中作为模型可信度可靠指标的使用提出了质疑。