Stochastic Layer-Wise Shuffle: A Good Practice to Improve Vision Mamba Training

Recent Vision Mamba models not only have much lower complexity for processing higher resolution images and longer videos but also the competitive performance with Vision Transformers (ViTs). However, they are stuck into overfitting and thus only present up to base size (about 80M). It is still unclear how vanilla Vision Mamba (Vim) can be efficiently scaled up to larger sizes, which is essentially for further exploitation. In this paper, we propose a stochastic layer-wise shuffle regularization, which empowers successfully scaling non-hierarchical Vision Mamba to a large size (about 300M) in a supervised setting. Specifically, our base and large-scale ShuffleMamba models can outperform the supervised ViTs of similar size by 0.8\% and 1.0\% classification accuracy on ImageNet1k, respectively, without auxiliary data. When evaluated on the ADE20K semantic segmentation and COCO detection tasks, our ShuffleMamba models also show significant improvements. Without bells and whistles, the stochastic layer-wise shuffle has the following highlights: (1) \textit{Plug and play:} it does not change model architectures and will be omitted in inference. (2) \textit{Simple but effective:} it can improve the overfitting in Vim training and only introduce random token permutation operations. (3) \textit{Intuitive:} the token sequences in deeper layers are more likely to be shuffled as they are expected to be more semantic and less sensitive to patch positions. Code and models will be available at https://github.com/huangzizheng01/ShuffleMamba.

翻译：近期提出的视觉Mamba模型不仅在处理高分辨率图像与长视频序列时具有显著更低的计算复杂度，同时展现出与视觉Transformer（ViT）相竞争的性能。然而，该类模型易陷入过拟合困境，目前仅成功构建至基础规模（约8000万参数）。如何将原始视觉Mamba（Vim）高效扩展至更大规模，仍是亟待解决的核心问题，这对进一步挖掘其潜力至关重要。本文提出一种随机层间洗牌正则化方法，成功在监督学习场景下将非层级式视觉Mamba扩展至大规模（约3亿参数）。具体而言，在ImageNet1k分类任务中，我们提出的基础版与大规模ShuffleMamba模型在未使用辅助数据的条件下，分类准确率分别超越同规模监督式ViT模型0.8%与1.0%。在ADE20K语义分割与COCO目标检测任务上的评估结果同样表明，ShuffleMamba模型带来显著性能提升。该随机层间洗牌方法具有以下突出优势：（1）即插即用：不改变模型架构，且推理阶段无需启用；（2）简洁高效：仅引入随机令牌置换操作，即可有效缓解Vim训练中的过拟合问题；（3）符合直觉：深层特征更具语义性而对图像块位置敏感性降低，因此深层令牌序列更适宜进行洗牌操作。代码与模型将在https://github.com/huangzizheng01/ShuffleMamba 开源。