Large-scale pre-trained vision foundation models, such as CLIP, have become de facto backbones for various vision tasks. However, due to their black-box nature, understanding the underlying rules behind these models' predictions and controlling model behaviors have remained open challenges. We present a framework for interpreting vision transformer's latent tokens with natural language. Given a latent token, our framework retains its semantic information to the final layer using transformer's local operations and retrieves the closest text for explanation. Our approach enables understanding of model visual reasoning procedure without needing additional model training or data collection. Based on the obtained interpretations, our framework allows for model editing that controls model reasoning behaviors and improves model robustness against biases and spurious correlations.
翻译:大规模预训练的视觉基础模型(如CLIP)已成为各类视觉任务的事实主干。然而由于其黑箱特性,理解这些模型预测背后的隐含规则并控制其行为仍是待解决的挑战。我们提出一个框架,通过自然语言解释视觉Transformer的隐层表征。对于给定的隐层表征,该框架利用Transformer的局部操作将其语义信息保留至最终层,并检索最相近的文本进行解释。该方法无需额外模型训练或数据收集即可理解模型的视觉推理过程。基于获得的解释,该框架支持通过模型编辑来调控模型推理行为,并提升模型对偏差和虚假相关性的鲁棒性。