Vision-language foundation models like CLIP have revolutionized the field of artificial intelligence. Nevertheless, VLM models supporting multi-language, e.g., in both Chinese and English, have lagged due to the relative scarcity of large-scale pretraining datasets. Toward this end, we introduce a comprehensive bilingual (Chinese-English) dataset BM-6B with over 6 billion image-text pairs, aimed at enhancing multimodal foundation models to well understand images in both languages. To handle such a scale of dataset, we propose a novel grouped aggregation approach for image-text contrastive loss computation, which reduces the communication overhead and GPU memory demands significantly, facilitating a 60% increase in training speed. We pretrain a series of bilingual image-text foundation models with an enhanced fine-grained understanding ability on BM-6B, the resulting models, dubbed as $M^2$-Encoders (pronounced "M-Square"), set new benchmarks in both languages for multimodal retrieval and classification tasks. Notably, Our largest $M^2$-Encoder-10B model has achieved top-1 accuracies of 88.5% on ImageNet and 80.7% on ImageNet-CN under a zero-shot classification setting, surpassing previously reported SoTA methods by 2.2% and 21.1%, respectively. The $M^2$-Encoder series represents one of the most comprehensive bilingual image-text foundation models to date, so we are making it available to the research community for further exploration and development.
翻译:视觉语言基础模型(如CLIP)已彻底改变了人工智能领域。然而,由于大规模预训练数据集的相对匮乏,支持多语言(例如中文和英文)的视觉语言模型发展滞后。为此,我们引入了一个包含超过60亿图文对的全面双语(中文-英文)数据集BM-6B,旨在增强多模态基础模型对两种语言图像的理解能力。为处理如此规模的数据集,我们提出了一种用于图文对比损失计算的新型分组聚合方法,该方法显著降低了通信开销和GPU内存需求,从而实现了训练速度提升60%。我们在BM-6B上预训练了一系列具有增强细粒度理解能力的双语图文基础模型,这些模型被称为$M^2$-Encoder(发音为“M-Square”),在多模态检索和分类任务中为两种语言树立了新的基准。值得注意的是,我们最大的$M^2$-Encoder-10B模型在零样本分类设置下,在ImageNet和ImageNet-CN上分别达到了88.5%和80.7%的Top-1准确率,相比此前报道的最优方法分别提升了2.2%和21.1%。$M^2$-Encoder系列代表了迄今为止最全面的双语图文基础模型之一,因此我们将其向研究社区开放,以供进一步探索和发展。