Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
翻译:大规模训练集已成为机器学习的基石,也是近期语言建模与多模态学习突破的基础。尽管预训练阶段的数据整理仍常采用临时方法,但一个常见范式是:首先从互联网收集海量数据池,再通过各种启发式规则将其过滤为实际训练集。本研究针对这一过滤环节,系统探讨了学习数据过滤网络(DFN)的问题。我们的核心发现是:用于数据过滤的网络质量与其在下游任务上的性能存在显著差异——例如,在ImageNet上表现优异的模型,其生成的训练集质量可能劣于一个ImageNet准确率较低但仅使用少量高质量数据训练的模型。基于这些发现,我们构建了能生成最先进的图像-文本数据集的新型数据过滤网络。具体而言,我们的最优数据集DFN-5B可在给定算力预算下训练出最先进的模型:以多种任务上的性能提升为例,在该数据集上训练的ViT-H模型在ImageNet上实现了83.0%的零样本迁移准确率,超越了在LAION-2B、DataComp-1B或OpenAI的WIT等其他数据集上训练的模型。为促进数据集设计的进一步研究,我们还发布了包含20亿样本的新数据集DFN-2B,并证明仅使用公开数据即可从头训练高性能数据过滤网络。