DataComp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre,Gabriel Ilharco,Alex Fang,Jonathan Hayase,Georgios Smyrnis,Thao Nguyen,Ryan Marten,Mitchell Wortsman,Dhruba Ghosh,Jieyu Zhang,Eyal Orgad,Rahim Entezari,Giannis Daras,Sarah Pratt,Vivek Ramanujan,Yonatan Bitton,Kalyani Marathe,Stephen Mussmann,Richard Vencu,Mehdi Cherti,Ranjay Krishna,Pang Wei Koh,Olga Saukh,Alexander Ratner,Shuran Song,Hannaneh Hajishirzi,Ali Farhadi,Romain Beaumont,Sewoong Oh,Alex Dimakis,Jenia Jitsev,Yair Carmon,Vaishaal Shankar,Ludwig Schmidt

Large multimodal datasets have been instrumental in recent breakthroughs such as CLIP, Stable Diffusion, and GPT-4. At the same time, datasets rarely receive the same research attention as model architectures or training algorithms. To address this shortcoming in the machine learning ecosystem, we introduce DataComp, a benchmark where the training code is fixed and researchers innovate by proposing new training sets. We provide a testbed for dataset experiments centered around a new candidate pool of 12.8B image-text pairs from Common Crawl. Participants in our benchmark design new filtering techniques or curate new data sources and then evaluate their new dataset by running our standardized CLIP training code and testing on 38 downstream test sets. Our benchmark consists of multiple scales, with four candidate pool sizes and associated compute budgets ranging from 12.8M to 12.8B samples seen during training. This multi-scale design facilitates the study of scaling trends and makes the benchmark accessible to researchers with varying resources. Our baseline experiments show that the DataComp workflow is a promising way of improving multimodal datasets. We introduce DataComp-1B, a dataset created by applying a simple filtering algorithm to the 12.8B candidate pool. The resulting 1.4B subset enables training a CLIP ViT-L/14 from scratch to 79.2% zero-shot accuracy on ImageNet. Our new ViT-L/14 model outperforms a larger ViT-g/14 trained on LAION-2B by 0.7 percentage points while requiring 9x less training compute. We also outperform OpenAI's CLIP ViT-L/14 by 3.7 percentage points, which is trained with the same compute budget as our model. These gains highlight the potential for improving model performance by carefully curating training sets. We view DataComp-1B as only the first step and hope that DataComp paves the way toward the next generation of multimodal datasets.

翻译：大规模多模态数据集在CLIP、Stable Diffusion和GPT-4等近期突破中发挥了关键作用。然而，与模型架构或训练算法相比，数据集很少获得同等程度的研究关注。为弥补机器学习生态系统中的这一不足，我们提出了DataComp基准——在该基准中，训练代码被固定，研究人员通过提出新型训练集进行创新。我们提供了一个围绕Common Crawl新候选池（包含128亿个图像-文本对）的数据集实验平台。参与者需设计新的过滤技术或策划新的数据源，然后通过运行标准化的CLIP训练代码并在38个下游测试集上进行测试来评估其新数据集。本基准包含多个规模层级，对应四个候选池大小及关联的计算预算（训练过程中样本量从1280万到128亿不等）。这种多尺度设计便于研究扩展规律，并使不同资源条件的研究人员均能参与该基准。基线实验表明，DataComp工作流是改进多模态数据集的有效途径。我们推出了DataComp-1B数据集——通过对128亿候选池应用简单过滤算法而创建。由此产生的14亿子集支持从零开始训练CLIP ViT-L/14模型，在ImageNet上达到79.2%的零样本准确率。我们的新ViT-L/14模型在计算量减少9倍的情况下，仍比在LAION-2B上训练的更大规模ViT-g/14模型高出0.7个百分点。同时，我们在相同计算预算下，比OpenAI的CLIP ViT-L/14提升了3.7个百分点。这些成果凸显了精心策划训练集对提升模型性能的潜力。我们将DataComp-1B视为第一步，并期待DataComp能为下一代多模态数据集的发展铺平道路。