Vision Transformers have received significant attention due to their impressive performance in many vision tasks. While the token mixer or attention block has been studied in great detail, the channel mixer or feature mixing block (FFN or MLP) has not been explored in depth albeit it accounts for a bulk of the parameters and computation in a model. In this work, we study whether sparse feature mixing can replace the dense connections and confirm this with a block diagonal MLP structure that improves the accuracy by supporting larger expansion ratios. To improve the feature clusters formed by this structure and thereby further improve the accuracy, a lightweight, parameter-free, channel covariance attention (CCA) mechanism is introduced as a parallel branch during training. This design of CCA enables gradual feature mixing across channel groups during training whose contribution decays to zero as the training progresses to convergence. This allows the CCA block to be discarded during inference, thus enabling enhanced performance with no additional computational cost. The resulting $\textit{Scalable CHannEl MixEr}$ (SCHEME) can be plugged into any ViT architecture to obtain a gamut of models with different trade-offs between complexity and performance by controlling the block diagonal structure size in the MLP. This is shown by the introduction of a new family of SCHEMEformer models. Experiments on image classification, object detection, and semantic segmentation, with different ViT backbones, consistently demonstrate substantial accuracy gains over existing designs, especially under lower FLOPs regimes. For example, the SCHEMEformer establishes a new SOTA of 79.7% accuracy for ViTs using pure attention mixers on ImageNet-1K at 1.77G FLOPs.
翻译:视觉Transformer因其在众多视觉任务中的卓越表现而受到广泛关注。尽管令牌混合器或注意力模块已被深入研究,但作为模型参数与计算量主要来源的通道混合器或特征混合模块(FFN或MLP)尚未得到充分探索。本文研究了稀疏特征混合能否替代密集连接,并通过块对角MLP结构验证了这一点——该结构通过支持更大的扩展比提升了准确率。为了改善该结构形成的特征聚类并进一步提高准确率,我们引入了一种轻量级、无参数的通道协方差注意力(CCA)机制作为训练过程中的并行分支。这种CCA设计使得训练期间能够跨通道组逐步进行特征混合,且其贡献度随训练收敛而衰减至零。因此,CCA模块可在推理阶段被丢弃,从而在不增加计算成本的前提下实现性能提升。所提出的可扩展通道混合器(SCHEME)可直接植入任何ViT架构,通过控制MLP中块对角结构的尺寸,获得一系列在复杂度与性能之间具有不同权衡的模型。我们通过引入新的SCHEMEformer模型家族对此进行了验证。在使用不同ViT骨干网络的图像分类、目标检测和语义分割实验中,SCHEME始终展现出相比现有设计的显著准确率提升,尤其在较低FLOPs场景下表现突出。例如,SCHEMEformer在ImageNet-1K数据集上以1.77G FLOPs的计算量,采用纯注意力混合器的ViT架构取得了79.7%准确率的新SOTA。