Vision Transformers (ViTs) have become a dominant architecture in computer vision, yet producing stable and high-resolution attribution maps for these models remains challenging. Architectural components such as patch embeddings and attention routing often introduce structured artifacts in pixel-level explanations, causing many existing methods to rely on coarse patch-level attributions. We introduce DAVE \textit{(\underline{D}istribution-aware \underline{A}ttribution via \underline{V}iT Gradient D\underline{E}composition)}, a mathematically grounded attribution method for ViTs based on a structured decomposition of the input gradient. By exploiting architectural properties of ViTs, DAVE isolates locally equivariant and stable components of the effective input--output mapping. It separates these from architecture-induced artifacts and other sources of instability.
翻译:视觉Transformer(ViT)已成为计算机视觉领域的主导架构,然而为这些模型生成稳定且高分辨率的归因图仍具挑战性。诸如补丁嵌入和注意力路由等架构组件常在像素级解释中引入结构化伪影,导致现有方法多依赖粗糙的补丁级归因。本文提出DAVE(基于ViT梯度分解的分布感知归因方法),这是一种基于输入梯度结构化分解的、具有数学基础的ViT归因方法。通过利用ViT的架构特性,DAVE能够分离出有效输入-输出映射中局部等变且稳定的分量,并将其与架构引起的伪影及其他不稳定性来源进行有效隔离。