Scaling up contrastive language-image pretraining (CLIP) is critical for empowering both vision and multimodal models. We present EVA-CLIP-18B, the largest and most powerful open-source CLIP model to date, with 18-billion parameters. With only 6-billion training samples seen, EVA-CLIP-18B achieves an exceptional 80.7% zero-shot top-1 accuracy averaged across 27 widely recognized image classification benchmarks, outperforming its forerunner EVA-CLIP (5-billion parameters) and other open-source CLIP models by a large margin. Remarkably, we observe a consistent performance improvement with the model size scaling of EVA-CLIP, despite maintaining a constant training dataset of 2-billion image-text pairs from LAION-2B and COYO-700M. This dataset is openly available and much smaller than the in-house datasets (e.g., DFN-5B, WebLI-10B) employed in other state-of-the-art CLIP models. EVA-CLIP-18B demonstrates the potential of EVA-style weak-to-strong visual model scaling. With our model weights made publicly available, we hope to facilitate future research in vision and multimodal foundation models.
翻译:扩展对比性语言-图像预训练(CLIP)对于增强视觉和多模态模型至关重要。我们提出EVA-CLIP-18B,这是目前最大且最强大的开源CLIP模型,拥有180亿参数。仅使用60亿训练样本,EVA-CLIP-18B在27个广泛认可的图像分类基准测试中平均达到80.7%的零样本Top-1准确率,以显著优势超越了其前代EVA-CLIP(50亿参数)及其他开源CLIP模型。值得注意的是,尽管训练数据集保持恒定(来自LAION-2B和COYO-700M的20亿图像-文本对),我们观察到EVA-CLIP的性能随着模型规模扩大而持续提升。该数据集完全公开,且远小于其他先进CLIP模型所使用的内部数据集(例如DFN-5B、WebLI-10B)。EVA-CLIP-18B展示了EVA范式下弱到强视觉模型扩展的潜力。通过公开我们的模型权重,我们期望推动未来视觉及多模态基础模型的研究。