CLIMP：基于Mamba的对比语言-图像预训练模型 (CLIMP: Contrastive Language-Image Mamba Pretraining)

Contrastive Language-Image Pre-training (CLIP) relies on Vision Transformers whose attention mechanism is susceptible to spurious correlations, and scales quadratically with resolution. To address these limitations, We present CLIMP, the first fully Mamba-based contrastive vision-language model that replaces both the vision and text encoders with Mamba. The new architecture encodes sequential structure in both vision and language, with VMamba capturing visual spatial inductive biases, reducing reliance on spurious correlations and producing an embedding space favorable for cross-modal retrieval and out-of-distribution robustness-surpassing OpenAI's CLIP-ViT-B by 7.5% on ImageNet-O. CLIMP naturally supports variable input resolutions without positional encoding interpolation or specialized training, achieving up to 6.6% higher retrieval accuracy at 16x training resolution while using 5x less memory and 1.8x fewer FLOPs. The autoregressive text encoder further overcomes CLIP's fixed context limitation, enabling dense captioning retrieval. Our findings suggest that Mamba exhibits advantageous properties for vision-language learning, making it a compelling alternative to Transformer-based CLIP.

翻译：对比语言-图像预训练（CLIP）依赖视觉Transformer，其注意力机制易受伪相关影响，且计算复杂度随分辨率呈二次方增长。为克服这些局限，本文提出CLIMP——首个完全基于Mamba架构的对比视觉语言模型，将视觉与文本编码器均替换为Mamba。新架构能编码视觉与语言的序列结构：VMamba通过捕捉视觉空间归纳偏置，降低对伪相关的依赖，构建出更有利于跨模态检索与分布外鲁棒性的嵌入空间——在ImageNet-O数据集上超越OpenAI的CLIP-ViT-B达7.5%。CLIMP天然支持可变输入分辨率，无需位置编码插值或特殊训练，在16倍训练分辨率下检索精度最高提升6.6%，同时内存消耗降低5倍、浮点运算量减少1.8倍。其自回归文本编码器进一步突破了CLIP的固定上下文限制，实现了密集描述检索。研究表明，Mamba在视觉语言学习领域展现出优势特性，成为替代基于Transformer的CLIP模型的可行方案。