Revisiting the Integration of Convolution and Attention for Vision Backbone

Convolutions (Convs) and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones. Although some works try to integrate both, they apply the two operators simultaneously at the finest pixel granularity. With Convs responsible for per-pixel feature extraction already, the question is whether we still need to include the heavy MHSAs at such a fine-grained level. In fact, this is the root cause of the scalability issue w.r.t. the input resolution for vision transformers. To address this important problem, we propose in this work to use MSHAs and Convs in parallel \textbf{at different granularity levels} instead. Specifically, in each layer, we use two different ways to represent an image: a fine-grained regular grid and a coarse-grained set of semantic slots. We apply different operations to these two representations: Convs to the grid for local features, and MHSAs to the slots for global features. A pair of fully differentiable soft clustering and dispatching modules is introduced to bridge the grid and set representations, thus enabling local-global fusion. Through extensive experiments on various vision tasks, we empirically verify the potential of the proposed integration scheme, named \textit{GLMix}: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few (e.g., 64) semantic slots to match the performance of recent state-of-the-art backbones, while being more efficient. Our visualization results also demonstrate that the soft clustering module produces a meaningful semantic grouping effect with only IN1k classification supervision, which may induce better interpretability and inspire new weakly-supervised semantic segmentation approaches. Code will be available at \url{https://github.com/rayleizhu/GLMix}.

翻译：卷积（Convs）与多头自注意力（MHSAs）通常被视为构建视觉骨干网络的两种替代方案。尽管已有一些工作尝试将二者融合，但它们均在最精细的像素粒度上同时应用这两种算子。既然卷积已负责逐像素特征提取，问题在于我们是否仍需在如此细粒度层级引入计算量巨大的多头自注意力模块。事实上，这正是视觉Transformer在处理高分辨率输入时面临可扩展性问题的根源。为解决这一关键问题，本研究提出在不同粒度层级并行使用多头自注意力与卷积。具体而言，我们在每个网络层采用两种方式表示图像：细粒度的规则网格与粗粒度的语义槽集合。针对这两种表征分别应用不同操作：对网格使用卷积提取局部特征，对语义槽使用多头自注意力捕获全局特征。通过引入一对完全可微的软聚类与分配模块，实现了网格表征与集合表征的桥接，从而完成局部-全局特征融合。通过在多种视觉任务上的大量实验，我们实证验证了所提融合方案（命名为 \textit{GLMix}）的潜力：通过将细粒度特征提取任务卸载给轻量级卷积模块，仅需在少量（例如64个）语义槽上应用多头自注意力即可达到当前先进骨干网络的性能，同时保持更高效率。可视化结果进一步表明，软聚类模块仅通过ImageNet-1K分类监督即可产生有意义的语义分组效果，这可能带来更好的可解释性并为新的弱监督语义分割方法提供启发。代码将发布于 \url{https://github.com/rayleizhu/GLMix}。