Despite the growing use of transformer models in computer vision, a mechanistic understanding of these networks is still needed. This work introduces a method to reverse-engineer Vision Transformers trained to solve image classification tasks. Inspired by previous research in NLP, we demonstrate how the inner representations at any level of the hierarchy can be projected onto the learned class embedding space to uncover how these networks build categorical representations for their predictions. We use our framework to show how image tokens develop class-specific representations that depend on attention mechanisms and contextual information, and give insights on how self-attention and MLP layers differentially contribute to this categorical composition. We additionally demonstrate that this method (1) can be used to determine the parts of an image that would be important for detecting the class of interest, and (2) exhibits significant advantages over traditional linear probing approaches. Taken together, our results position our proposed framework as a powerful tool for mechanistic interpretability and explainability research.
翻译:尽管Transformer模型在计算机视觉中的应用日益广泛,但对其网络机制的理解仍显不足。本文提出一种逆向工程方法,用于解析经图像分类任务训练的视觉Transformer模型。受自然语言处理领域先前研究的启发,我们演示如何将模型任意层级的内在表示投影到学习的类别嵌入空间,从而揭示这些网络如何构建用于预测的分类表征。通过该框架,我们展示了图像标记如何发展出依赖注意力机制与上下文信息的类别特异性表征,并深入解析了自注意力层与多层感知机层在这一分类构建过程中产生的差异化贡献。此外,我们证明该方法:(1)可用于识别对检测目标类别具有关键影响的图像区域;(2)相较于传统线性探针方法展现出显著优势。综合实验结果表明,本框架可作为机制可解释性与可解释性研究的强大工具。