Interpretability in Vision-Language Models (VLMs) is crucial for trust, debugging, and decision-making in high-stakes applications. We introduce PixelSHAP, a model-agnostic framework extending Shapley-based analysis to structured visual entities. Unlike previous methods focusing on text prompts, PixelSHAP applies to vision-based reasoning by systematically perturbing image objects and quantifying their influence on a VLM's response. PixelSHAP requires no model internals, operating solely on input-output pairs, making it compatible with open-source and commercial models. It supports diverse embedding-based similarity metrics and scales efficiently using optimization techniques inspired by Shapley-based methods. We validate PixelSHAP in autonomous driving, highlighting its ability to enhance interpretability. Key challenges include segmentation sensitivity and object occlusion. Our open-source implementation facilitates further research.
翻译:视觉-语言模型的可解释性对于高风险应用中的信任建立、调试和决策制定至关重要。我们提出了PixelSHAP,这是一个模型无关的框架,它将基于Shapley值的分析扩展到结构化视觉实体。与先前专注于文本提示的方法不同,PixelSHAP通过系统性地扰动图像对象并量化它们对VLM响应的影响,适用于基于视觉的推理。PixelSHAP无需模型内部信息,仅基于输入-输出对进行操作,使其与开源和商业模型兼容。它支持多种基于嵌入的相似性度量,并利用受基于Shapley方法启发的优化技术实现高效扩展。我们在自动驾驶场景中验证了PixelSHAP,突显了其增强可解释性的能力。关键挑战包括分割敏感性和对象遮挡。我们的开源实现有助于进一步的研究。