Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 83.0% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
翻译:大规模训练集已成为机器学习的基石,并为语言建模和多模态学习的最新进展奠定了基础。尽管预训练数据整理通常仍依赖临时方法,但常见范式是首先从网络收集海量数据池,然后通过多种启发式策略将该候选池过滤为实际训练集。在本工作中,我们研究学习数据过滤网络(DFN)以完成此第二步过滤大型非精选数据集的任务。我们的关键发现是:过滤网络的优劣与其在下游任务上的性能存在显著差异——例如,在ImageNet上表现良好的模型所产生的训练集,可能不如使用少量高质量数据训练且ImageNet准确率较低的模型所生成的数据集。基于这一见解,我们构建了能够产生最先进图文数据集的新型数据过滤网络。具体而言,我们性能最佳的DFN-5B数据集使得在给定计算预算下训练的模型达到当前最优水平:在多种任务改进中,基于该数据集训练的ViT-H模型在ImageNet上实现了83.0%的零样本迁移准确率,优于在LAION-2B、DataComp-1B或OpenAI WIT等其他数据集上训练的模型。为促进数据集设计的进一步研究,我们还发布了包含20亿样本的新数据集DFN-2B,并证明仅使用公开数据即可从头训练高性能数据过滤网络。