The success of contrastive language-image pretraining (CLIP) relies on the supervision from the pairing between images and captions, which tends to be noisy in web-crawled data. We present Mixture of Data Experts (MoDE) and learn a system of CLIP data experts via clustering. Each data expert is trained on one data cluster, being less sensitive to false negative noises in other clusters. At inference time, we ensemble their outputs by applying weights determined through the correlation between task metadata and cluster conditions. To estimate the correlation precisely, the samples in one cluster should be semantically similar, but the number of data experts should still be reasonable for training and inference. As such, we consider the ontology in human language and propose to use fine-grained cluster centers to represent each data expert at a coarse-grained level. Experimental studies show that four CLIP data experts on ViT-B/16 outperform the ViT-L/14 by OpenAI CLIP and OpenCLIP on zero-shot image classification but with less ($<$35\%) training cost. Meanwhile, MoDE can train all data expert asynchronously and can flexibly include new data experts. The code is available at https://github.com/facebookresearch/MetaCLIP/tree/main/mode.
翻译:摘要:对比语言-图像预训练(CLIP)的成功依赖于图像与文本描述之间的配对监督,但在网络爬取的数据中,这种配对往往包含噪声。我们提出混合数据专家(MoDE),并通过聚类学习一个CLIP数据专家系统。每个数据专家在一个数据簇上训练,从而对其它簇中的假阴性噪声具有较低的敏感性。在推理时,我们通过任务元数据与簇条件之间的相关性确定权重,并集成各数据专家的输出。为精确估计相关性,同一簇中的样本应在语义上保持相似,同时数据专家的数量需在训练和推理中保持合理。为此,我们考虑人类语言本体,并提出使用粗粒度层级上的细粒度簇中心来表征每个数据专家。实验表明,在ViT-B/16上使用四个CLIP数据专家,其零样本图像分类性能优于OpenAI CLIP和OpenCLIP的ViT-L/14,但训练成本降低不足35%。此外,MoDE可异步训练所有数据专家,并灵活纳入新的数据专家。代码已开源:https://github.com/facebookresearch/MetaCLIP/tree/main/mode。