State Space Models (SSMs) have the advantage of keeping linear computational complexity compared to attention modules in transformers, and have been applied to vision tasks as a new type of powerful vision foundation model. Inspired by the observations that the final prediction in vision transformers (ViTs) is only based on a subset of most informative tokens, we take the novel step of enhancing the efficiency of SSM-based vision models through token-based pruning. However, direct applications of existing token pruning techniques designed for ViTs fail to deliver good performance, even with extensive fine-tuning. To address this issue, we revisit the unique computational characteristics of SSMs and discover that naive application disrupts the sequential token positions. This insight motivates us to design a novel and general token pruning method specifically for SSM-based vision models. We first introduce a pruning-aware hidden state alignment method to stabilize the neighborhood of remaining tokens for performance enhancement. Besides, based on our detailed analysis, we propose a token importance evaluation method adapted for SSM models, to guide the token pruning. With efficient implementation and practical acceleration methods, our method brings actual speedup. Extensive experiments demonstrate that our approach can achieve significant computation reduction with minimal impact on performance across different tasks. Notably, we achieve 81.7\% accuracy on ImageNet with a 41.6\% reduction in the FLOPs for pruned PlainMamba-L3. Furthermore, our work provides deeper insights into understanding the behavior of SSM-based vision models for future research.
翻译:状态空间模型(SSMs)相较于Transformer中的注意力模块具有保持线性计算复杂度的优势,并作为一种新型强大的视觉基础模型被应用于视觉任务。受视觉Transformer(ViTs)中最终预测仅基于最具信息量的令牌子集这一观察的启发,我们首次尝试通过基于令牌的剪枝来提升基于SSM的视觉模型的效率。然而,直接应用为ViTs设计的现有令牌剪枝技术即使经过大量微调也无法获得良好性能。为解决此问题,我们重新审视了SSMs独特的计算特性,发现简单应用会破坏令牌的序列位置关系。这一洞见促使我们专门为基于SSM的视觉模型设计一种新颖且通用的令牌剪枝方法。我们首先引入一种剪枝感知的隐藏状态对齐方法,通过稳定剩余令牌的邻域关系来提升性能。此外,基于详细分析,我们提出了一种适配SSM模型的令牌重要性评估方法以指导剪枝过程。通过高效实现和实际加速方法,我们的方案能带来真实的加速效果。大量实验表明,该方法能在不同任务中以极小的性能影响实现显著的计算量削减。值得注意的是,在ImageNet数据集上,剪枝后的PlainMamba-L3在FLOPs减少41.6%的情况下仍能达到81.7%的准确率。此外,我们的工作为未来研究理解基于SSM的视觉模型行为提供了更深入的见解。