Multimodal datasets are a critical component in recent breakthroughs such as Stable Diffusion and GPT-4, yet their design does not receive the same research attention as model architectures or training algorithms. To address this shortcoming in the ML ecosystem, we introduce DataComp, a testbed for dataset experiments centered around a new candidate pool of 12.8 billion image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing the resulting model on 38 downstream test sets. Our benchmark consists of multiple compute scales spanning four orders of magnitude, which enables the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow leads to better training sets. In particular, our best baseline, DataComp-1B, enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet, outperforming OpenAI's CLIP ViT-L/14 by 3.7 percentage points while using the same training procedure and compute. We release DataComp and all accompanying code at www.datacomp.ai.
翻译:多模态数据集是Stable Diffusion和GPT-4等近期突破性成果的关键组成部分,然而其设计并未像模型架构或训练算法那样受到同等研究关注。为弥补机器学习生态系统的这一不足,我们推出DataComp——一个以Common Crawl中128亿个图文对候选池为核心的数据集实验测试平台。基准测试参与者可设计新型过滤技术或策展新数据源,通过运行标准化CLIP训练代码并在38个下游测试集上评估模型性能,来验证其数据集效果。我们的基准测试涵盖四个数量级的多重计算规模,既可研究扩展趋势,也能让不同资源条件的研究者参与其中。基线实验表明,DataComp工作流可生成更优的训练集:最优基线DataComp-1B使用相同训练流程与计算资源,从零训练CLIP ViT-L/14模型在ImageNet上达到79.2%的零样本准确率,较OpenAI的CLIP ViT-L/14高出3.7个百分点。我们在www.datacomp.ai开放DataComp平台及其全部配套代码。