Vision Transformers (ViTs) have achieved state-of-the-art performance in image classification, yet their attention mechanisms often remain opaque and exhibit dense, non-structured behaviors. In this work, we adapt our previously proposed SVD-Inspired Attention (SVDA) mechanism to the ViT architecture, introducing a geometrically grounded formulation that enhances interpretability, sparsity, and spectral structure. We apply the use of interpretability indicators -- originally proposed with SVDA -- to monitor attention dynamics during training and assess structural properties of the learned representations. Experimental evaluations on four widely used benchmarks -- CIFAR-10, FashionMNIST, CIFAR-100, and ImageNet-100 -- demonstrate that SVDA consistently yields more interpretable attention patterns without sacrificing classification accuracy. While the current framework offers descriptive insights rather than prescriptive guidance, our results establish SVDA as a comprehensive and informative tool for analyzing and developing structured attention models in computer vision. This work lays the foundation for future advances in explainable AI, spectral diagnostics, and attention-based model compression.
翻译:视觉Transformer(ViT)在图像分类任务中已取得最先进的性能,但其注意力机制往往缺乏透明度,并表现出密集、非结构化的行为。本研究将先前提出的SVD启发注意力(SVDA)机制适配到ViT架构中,引入了一种基于几何原理的数学表述,从而增强了注意力机制的可解释性、稀疏性和谱结构特性。我们采用最初为SVDA设计的可解释性指标,在训练过程中监测注意力动态变化,并评估所学表征的结构特性。在四个广泛使用的基准数据集——CIFAR-10、FashionMNIST、CIFAR-100和ImageNet-100——上的实验评估表明,SVDA在保持分类精度的同时,始终能产生更具可解释性的注意力模式。虽然当前框架主要提供描述性见解而非规范性指导,但我们的研究结果确立了SVDA作为分析和开发结构化注意力模型的全面且信息丰富的工具。这项工作为未来在可解释人工智能、谱诊断以及基于注意力的模型压缩等方向的进展奠定了基础。