Vision transformer (ViT) expands the success of transformer models from sequential data to images. The model decomposes an image into many smaller patches and arranges them into a sequence. Multi-head self-attentions are then applied to the sequence to learn the attention between patches. Despite many successful interpretations of transformers on sequential data, little effort has been devoted to the interpretation of ViTs, and many questions remain unanswered. For example, among the numerous attention heads, which one is more important? How strong are individual patches attending to their spatial neighbors in different heads? What attention patterns have individual heads learned? In this work, we answer these questions through a visual analytics approach. Specifically, we first identify what heads are more important in ViTs by introducing multiple pruning-based metrics. Then, we profile the spatial distribution of attention strengths between patches inside individual heads, as well as the trend of attention strengths across attention layers. Third, using an autoencoder-based learning solution, we summarize all possible attention patterns that individual heads could learn. Examining the attention strengths and patterns of the important heads, we answer why they are important. Through concrete case studies with experienced deep learning experts on multiple ViTs, we validate the effectiveness of our solution that deepens the understanding of ViTs from head importance, head attention strength, and head attention pattern.
翻译:视觉Transformer(ViT)将Transformer模型从序列数据成功扩展到图像领域。该模型将图像分解为多个较小的图像块,并将它们排列成序列。随后,多头自注意力机制被应用于该序列以学习图像块之间的注意力关系。尽管已有许多关于Transformer在序列数据上解释的成功案例,但对ViT解释的研究仍相对较少,许多问题尚待解答。例如,在众多注意力头中,哪些更为重要?在不同注意力头中,单个图像块对其空间邻域的注意力强度如何?各个注意力头学习了哪些注意力模式?本研究通过可视化分析方法回答了这些问题。具体而言,我们首先引入多种基于剪枝的度量指标,识别出ViT中更重要的注意力头;其次,刻画各注意力头内图像块之间注意力强度的空间分布及跨注意力层的强度变化趋势;第三,基于自编码器学习方法,归纳各个注意力头可能学习到的所有注意力模式。通过分析重要注意力头的注意力强度与模式,我们揭示了其重要性的原因。通过与多位资深深度学习专家在多类ViT模型上展开的具体案例研究,验证了我们方法的有效性——该方法从注意力头的重要性、注意力强度及注意力模式三个维度深化了对ViT的理解。