Mechanistic interpretability seeks to understand the neural mechanisms that enable specific behaviors in Large Language Models (LLMs) by leveraging causality-based methods. While these approaches have identified neural circuits that copy spans of text, capture factual knowledge, and more, they remain unusable for multimodal models since adapting these tools to the vision-language domain requires considerable architectural changes. In this work, we adapt a unimodal causal tracing tool to BLIP to enable the study of the neural mechanisms underlying image-conditioned text generation. We demonstrate our approach on a visual question answering dataset, highlighting the causal relevance of later layer representations for all tokens. Furthermore, we release our BLIP causal tracing tool as open source to enable further experimentation in vision-language mechanistic interpretability by the community. Our code is available at https://github.com/vedantpalit/Towards-Vision-Language-Mechanistic-Interpretability.
翻译:机制可解释性旨在通过基于因果性的方法,理解大型语言模型中实现特定行为的神经机制。尽管这些方法已识别出复制文本片段、捕获事实知识等的神经回路,但它们仍无法用于多模态模型,因为将这些工具适配到视觉-语言领域需要大量的架构调整。在本工作中,我们将一种单模态因果追踪工具适配到BLIP,以研究图像条件化文本生成所依赖的神经机制。我们在一个视觉问答数据集上展示了我们的方法,突出了所有token在后期层表示中的因果相关性。此外,我们将我们的BLIP因果追踪工具以开源形式发布,以便社区在视觉-语言机制可解释性方面进行进一步实验。我们的代码可在https://github.com/vedantpalit/Towards-Vision-Language-Mechanistic-Interpretability获取。