Strip-MLP: Efficient Token Interaction for Vision MLP

Token interaction operation is one of the core modules in MLP-based models to exchange and aggregate information between different spatial locations. However, the power of token interaction on the spatial dimension is highly dependent on the spatial resolution of the feature maps, which limits the model's expressive ability, especially in deep layers where the feature are down-sampled to a small spatial size. To address this issue, we present a novel method called \textbf{Strip-MLP} to enrich the token interaction power in three ways. Firstly, we introduce a new MLP paradigm called Strip MLP layer that allows the token to interact with other tokens in a cross-strip manner, enabling the tokens in a row (or column) to contribute to the information aggregations in adjacent but different strips of rows (or columns). Secondly, a \textbf{C}ascade \textbf{G}roup \textbf{S}trip \textbf{M}ixing \textbf{M}odule (CGSMM) is proposed to overcome the performance degradation caused by small spatial feature size. The module allows tokens to interact more effectively in the manners of within-patch and cross-patch, which is independent to the feature spatial size. Finally, based on the Strip MLP layer, we propose a novel \textbf{L}ocal \textbf{S}trip \textbf{M}ixing \textbf{M}odule (LSMM) to boost the token interaction power in the local region. Extensive experiments demonstrate that Strip-MLP significantly improves the performance of MLP-based models on small datasets and obtains comparable or even better results on ImageNet. In particular, Strip-MLP models achieve higher average Top-1 accuracy than existing MLP-based models by +2.44\% on Caltech-101 and +2.16\% on CIFAR-100. The source codes will be available at~\href{https://github.com/Med-Process/Strip_MLP{https://github.com/Med-Process/Strip\_MLP}.

翻译：令牌交互操作是基于MLP模型在不同空间位置间进行信息交换与聚合的核心模块之一。然而，空间维度上的令牌交互能力高度依赖于特征图的空间分辨率，这限制了模型的表达能力，尤其是在深层特征被下采样至较小空间尺寸时。为解决此问题，我们提出一种名为**Strip-MLP**的新方法，通过三种途径增强令牌交互能力。首先，我们引入一种新的MLP范式——Strip MLP层，该层允许令牌以跨条带方式与其他令牌交互，使行（或列）中的令牌能够为相邻但不同行（或列）条带的信息聚合做出贡献。其次，提出**级联分组条带混合模块**（CGSMM），以克服空间特征尺寸过小导致的性能退化。该模块使令牌能够通过块内与跨块方式实现更有效的交互，其效果与特征空间尺寸无关。最后，基于Strip MLP层，我们提出一种新颖的**局部条带混合模块**（LSMM），以增强局部区域内的令牌交互能力。大量实验表明，Strip-MLP在小规模数据集上显著提升了基于MLP模型的性能，并在ImageNet上取得相当甚至更优的结果。具体而言，Strip-MLP模型在Caltech-101和CIFAR-100上相较现有MLP模型分别实现了+2.44%和+2.16%的平均Top-1准确率提升。源代码将发布于~\href{https://github.com/Med-Process/Strip_MLP}{https://github.com/Med-Process/Strip\_MLP}。