We introduce a new attention mechanism, dubbed structural self-attention (StructSA), that leverages rich correlation patterns naturally emerging in key-query interactions of attention. StructSA generates attention maps by recognizing space-time structures of key-query correlations via convolution and uses them to dynamically aggregate local contexts of value features. This effectively leverages rich structural patterns in images and videos such as scene layouts, object motion, and inter-object relations. Using StructSA as a main building block, we develop the structural vision transformer (StructViT) and evaluate its effectiveness on both image and video classification tasks, achieving state-of-the-art results on ImageNet-1K, Kinetics-400, Something-Something V1 & V2, Diving-48, and FineGym.
翻译:我们提出了一种新的注意力机制,称为结构自注意力(StructSA),该机制利用了注意力中键-查询交互中自然涌现的丰富相关性模式。StructSA通过卷积识别键-查询相关性的时空结构来生成注意力图,并利用这些动态聚合值特征的局部上下文。这有效利用了图像和视频中丰富的结构模式,例如场景布局、物体运动以及物体间关系。以StructSA为主要构建模块,我们开发了结构视觉Transformer(StructViT),并在图像和视频分类任务上评估其有效性,在ImageNet-1K、Kinetics-400、Something-Something V1与V2、Diving-48以及FineGym上取得了最优结果。