Transformers have revolutionized Computer Vision (CV) through self-attention mechanisms. However, their complexity makes latent token representations difficult to interpret. We introduce ULTra, a framework for interpreting Transformer embeddings and uncovering meaningful semantic patterns within them. ULTra enables unsupervised semantic segmentation using pre-trained models without requiring fine-tuning. Additionally, we propose a self-supervised training approach that refines segmentation performance by learning an external transformation matrix without modifying the underlying model. Our method achieves state-of-the-art performance in unsupervised semantic segmentation, outperforming existing segmentation methods. Furthermore, we validate ULTra for model interpretation on both synthetic and real-world scenarios, including Object Selection and interpretable text summarization using LLMs, demonstrating its broad applicability in explaining the semantic structure of latent token representations.
翻译:Transformer通过自注意力机制彻底改变了计算机视觉领域。然而,其复杂性使得潜在令牌表示难以解释。我们提出了ULTra,一个用于解释Transformer嵌入并揭示其中有意义语义模式的框架。ULTra能够利用预训练模型进行无监督语义分割,而无需微调。此外,我们提出了一种自监督训练方法,通过学习一个外部变换矩阵来优化分割性能,而无需修改底层模型。我们的方法在无监督语义分割任务中取得了最先进的性能,超越了现有的分割方法。进一步地,我们在合成场景和真实场景(包括对象选择以及使用LLM的可解释文本摘要)上验证了ULTra在模型解释方面的有效性,证明了其在解释潜在令牌表示的语义结构方面具有广泛的适用性。