Visual Dependency Transformers: Dependency Tree Emerges from Reversed Attention

Humans possess a versatile mechanism for extracting structured representations of our visual world. When looking at an image, we can decompose the scene into entities and their parts as well as obtain the dependencies between them. To mimic such capability, we propose Visual Dependency Transformers (DependencyViT) that can induce visual dependencies without any labels. We achieve that with a novel neural operator called \emph{reversed attention} that can naturally capture long-range visual dependencies between image patches. Specifically, we formulate it as a dependency graph where a child token in reversed attention is trained to attend to its parent tokens and send information following a normalized probability distribution rather than gathering information in conventional self-attention. With such a design, hierarchies naturally emerge from reversed attention layers, and a dependency tree is progressively induced from leaf nodes to the root node unsupervisedly. DependencyViT offers several appealing benefits. (i) Entities and their parts in an image are represented by different subtrees, enabling part partitioning from dependencies; (ii) Dynamic visual pooling is made possible. The leaf nodes which rarely send messages can be pruned without hindering the model performance, based on which we propose the lightweight DependencyViT-Lite to reduce the computational and memory footprints; (iii) DependencyViT works well on both self- and weakly-supervised pretraining paradigms on ImageNet, and demonstrates its effectiveness on 8 datasets and 5 tasks, such as unsupervised part and saliency segmentation, recognition, and detection.

翻译：人类拥有提取视觉世界结构化表示的多功能机制。当观察图像时，我们能够将场景分解为实体及其组成部分，并获取它们之间的依赖关系。为模仿这种能力，我们提出了视觉依赖变换器（DependencyViT），它能在无需任何标签的情况下诱导视觉依赖。我们通过一种名为“反转注意力”的新型神经算子实现这一目标，该算子能够自然捕捉图像块之间的长程视觉依赖。具体而言，我们将其形式化为一个依赖图，其中反转注意力中的子令牌被训练为关注其父令牌，并按照归一化概率分布发送信息，而非像传统自注意力那样汇聚信息。通过这种设计，层次结构从反转注意力层中自然涌现，无监督地逐步从叶节点到根节点诱导出依赖树。DependencyViT具有多项吸引人的优势：(i) 图像中的实体及其组成部分由不同子树表示，从而实现了基于依赖的部件划分；(ii) 实现了动态视觉池化。那些极少发送信息的叶节点可以被剪除而不影响模型性能，基于此我们提出了轻量级DependencyViT-Lite以减少计算和内存占用；(iii) DependencyViT在ImageNet上的自监督和弱监督预训练范式中均表现良好，并在8个数据集和5个任务（如无监督部件与显著性分割、识别及检测）中证明了其有效性。