Convolutional networks and vision transformers have different forms of pairwise interactions, pooling across layers and pooling at the end of the network. Does the latter really need to be different? As a by-product of pooling, vision transformers provide spatial attention for free, but this is most often of low quality unless self-supervised, which is not well studied. Is supervision really the problem? In this work, we develop a generic pooling framework and then we formulate a number of existing methods as instantiations. By discussing the properties of each group of methods, we derive SimPool, a simple attention-based pooling mechanism as a replacement of the default one for both convolutional and transformer encoders. We find that, whether supervised or self-supervised, this improves performance on pre-training and downstream tasks and provides attention maps delineating object boundaries in all cases. One could thus call SimPool universal. To our knowledge, we are the first to obtain attention maps in supervised transformers of at least as good quality as self-supervised, without explicit losses or modifying the architecture. Code at: https://github.com/billpsomas/simpool.
翻译:卷积网络与视觉Transformer具有不同形式的成对交互、跨层池化及网络末端池化。后者是否真的需要差异化?作为池化的副产品,视觉Transformer可免费提供空间注意力,但除非采用自监督学习,这种注意力通常质量低下,而这一现象尚未得到充分研究。监督机制真是问题根源吗?在本工作中,我们构建了一个通用池化框架,并将现有多种方法作为该框架的实例化进行系统阐述。通过剖析各类方法的特性,我们推导出SimPool——一种基于简单注意力机制的池化方案,可替代卷积与Transformer编码器的默认池化策略。实验表明,无论采用监督还是自监督学习,该方法均能提升预训练及下游任务性能,并在所有场景下生成勾勒物体边界的注意力图谱。可以说SimPool具有通用性。据我们所知,这是首次无需显式损失函数或修改架构,即可使监督式Transformer获得至少与自监督学习同等质量的注意力图谱。代码地址:https://github.com/billpsomas/simpool。