Adapting foundation models for specific purposes has become a standard approach to build machine learning systems for downstream applications. Yet, it is an open question which mechanisms take place during adaptation. Here we develop a new Sparse Autoencoder (SAE) for the CLIP vision transformer, named PatchSAE, to extract interpretable concepts at granular levels (e.g. shape, color, or semantics of an object) and their patch-wise spatial attributions. We explore how these concepts influence the model output in downstream image classification tasks and investigate how recent state-of-the-art prompt-based adaptation techniques change the association of model inputs to these concepts. While activations of concepts slightly change between adapted and non-adapted models, we find that the majority of gains on common adaptation tasks can be explained with the existing concepts already present in the non-adapted foundation model. This work provides a concrete framework to train and use SAEs for Vision Transformers and provides insights into explaining adaptation mechanisms.
翻译:为特定目的调整基础模型已成为构建下游应用机器学习系统的标准方法。然而,适应过程中具体发生何种机制仍是一个开放性问题。本文为CLIP视觉Transformer开发了一种新的稀疏自编码器(SAE),命名为PatchSAE,用于在细粒度层面(例如物体的形状、颜色或语义)提取可解释概念及其基于图像块的空间归因。我们探究了这些概念如何影响下游图像分类任务中的模型输出,并研究了当前最先进的基于提示的适应技术如何改变模型输入与这些概念的关联。虽然适应模型与非适应模型之间的概念激活仅有轻微变化,但我们发现常见适应任务中的性能提升主要可通过非适应基础模型中已存在的既有概念来解释。本研究为视觉Transformer的训练和使用SAE提供了具体框架,并为解释适应机制提供了新的见解。