State space models and Mamba-based models have been increasingly applied across various domains, achieving state-of-the-art performance. This technical report introduces the first attempt to train a transferable Mamba model utilizing contrastive language-image pretraining (CLIP). We have trained Mamba models of varying sizes and undertaken comprehensive evaluations of these models on 26 zero-shot classification datasets and 16 out-of-distribution (OOD) datasets. Our findings reveal that a Mamba model with 67 million parameters is on par with a 307 million-parameter Vision Transformer (ViT) model in zero-shot classification tasks, highlighting the parameter efficiency of Mamba models. In tests of OOD generalization, Mamba-based models exhibit exceptional performance in conditions of OOD image contrast or when subjected to high-pass filtering. However, a Hessian analysis indicates that Mamba models feature a sharper and more non-convex landscape compared to ViT-based models, making them more challenging to train. The source code is available at https://github.com/raytrun/mamba-clip.
翻译:状态空间模型与基于Mamba的模型日益广泛应用于各个领域,并取得了最先进的性能。本技术报告首次尝试利用对比语言-图像预训练(CLIP)训练可迁移的Mamba模型。我们训练了不同规模的Mamba模型,并在26个零样本分类数据集和16个分布外(OOD)数据集上对这些模型进行了全面评估。研究结果表明,参数量为6700万的Mamba模型在零样本分类任务中与参数量为3.07亿的Vision Transformer(ViT)模型性能相当,凸显了Mamba模型的参数效率。在OOD泛化测试中,基于Mamba的模型在OOD图像对比度变化或经历高通滤波时表现出卓越性能。然而,Hessian分析表明,与基于ViT的模型相比,Mamba模型具有更尖锐且更非凸的损失景观,使得训练更具挑战性。源代码见https://github.com/raytrun/mamba-clip。