Nowadays, Vision Transformer (ViT) is widely utilized in various computer vision tasks, owing to its unique self-attention mechanism. However, the model architecture of ViT is complex and often challenging to comprehend, leading to a steep learning curve. ViT developers and users frequently encounter difficulties in interpreting its inner workings. Therefore, a visualization system is needed to assist ViT users in understanding its functionality. This paper introduces EL-VIT, an interactive visual analytics system designed to probe the Vision Transformer and facilitate a better understanding of its operations. The system consists of four layers of visualization views. The first three layers include model overview, knowledge background graph, and model detail view. These three layers elucidate the operation process of ViT from three perspectives: the overall model architecture, detailed explanation, and mathematical operations, enabling users to understand the underlying principles and the transition process between layers. The fourth interpretation view helps ViT users and experts gain a deeper understanding by calculating the cosine similarity between patches. Our two usage scenarios demonstrate the effectiveness and usability of EL-VIT in helping ViT users understand the working mechanism of ViT.
翻译:当前,Vision Transformer(ViT)凭借其独特的自注意力机制被广泛应用于各类计算机视觉任务。然而,ViT的模型架构复杂且难以理解,导致其学习曲线陡峭。ViT开发者与用户常面临解读其内部运行机理的困境。因此,亟需一套可视化系统辅助ViT用户理解其功能。本文提出EL-VIT——一个用于探测Vision Transformer并促进其运行机制理解的交互式视觉分析系统。该系统包含四层可视化视图:前三层涵盖模型总览、知识背景图谱与模型细节视图,分别从整体模型架构、详细解释和数学运算三个角度阐明ViT的操作过程,使用户能够掌握其基本原理及层间转换过程;第四层解释视图通过计算块(patch)间的余弦相似度,帮助ViT用户与专家获得更深入的理解。我们的两个使用场景验证了EL-VIT在协助ViT用户理解其工作机制方面的有效性与可用性。