Mamba and Vision Mamba (Vim) models have shown their potential as an alternative to methods based on Transformer architecture. This work introduces Fast Mamba for Vision (Famba-V), a cross-layer token fusion technique to enhance the training efficiency of Vim models. The key idea of Famba-V is to identify and fuse similar tokens across different Vim layers based on a suit of cross-layer strategies instead of simply applying token fusion uniformly across all the layers that existing works propose. We evaluate the performance of Famba-V on CIFAR-100. Our results show that Famba-V is able to enhance the training efficiency of Vim models by reducing both training time and peak memory usage during training. Moreover, the proposed cross-layer strategies allow Famba-V to deliver superior accuracy-efficiency trade-offs. These results all together demonstrate Famba-V as a promising efficiency enhancement technique for Vim models.
翻译:Mamba与视觉Mamba(Vim)模型已展现出作为Transformer架构方法替代方案的潜力。本研究提出面向视觉的快速Mamba(Famba-V),这是一种通过跨层令牌融合技术提升Vim模型训练效率的方法。Famba-V的核心思想在于:基于一套跨层策略识别并融合不同Vim层中的相似令牌,而非如现有研究提出的简单在所有层中统一实施令牌融合。我们在CIFAR-100数据集上评估了Famba-V的性能。实验结果表明,Famba-V能够通过减少训练时间与训练期间的峰值内存使用,有效提升Vim模型的训练效率。此外,所提出的跨层策略使Famba-V实现了更优的精度-效率权衡。这些结果共同证明了Famba-V是一种具有前景的Vim模型效率增强技术。