Large training sets have become a cornerstone of machine learning and are the foundation for recent advances in language modeling and multimodal learning. While data curation for pre-training is often still ad-hoc, one common paradigm is to first collect a massive pool of data from the Web and then filter this candidate pool down to an actual training set via various heuristics. In this work, we study the problem of learning a data filtering network (DFN) for this second step of filtering a large uncurated dataset. Our key finding is that the quality of a network for filtering is distinct from its performance on downstream tasks: for instance, a model that performs well on ImageNet can yield worse training sets than a model with low ImageNet accuracy that is trained on a small amount of high-quality data. Based on our insights, we construct new data filtering networks that induce state-of-the-art image-text datasets. Specifically, our best performing dataset DFN-5B enables us to train state-of-the-art CLIP models for their compute budgets: among other improvements on a variety of tasks, a ViT-H trained on our dataset achieves 84.4% zero-shot transfer accuracy on ImageNet, out-performing models trained on other datasets such as LAION-2B, DataComp-1B, or OpenAI's WIT. In order to facilitate further research in dataset design, we also release a new 2 billion example dataset DFN-2B and show that high performance data filtering networks can be trained from scratch using only publicly available data.
翻译:大型训练集已成为机器学习的基石,并为语言模型与多模态学习的近期突破奠定了基础。尽管预训练数据整理仍常采用临时性方法,但一种常见范式是:首先从网络收集海量数据池,随后通过多种启发式策略将该候选池过滤为实际训练集。本研究聚焦于第二步——针对未整理大规模数据集过滤任务,探索如何学习数据过滤网络(DFN)。关键发现表明:用于过滤的网络质量与其在下游任务中的表现存在差异——例如,在ImageNet上表现优异的模型,可能比使用少量高质量数据训练且ImageNet准确率较低的模型,生成更差的训练集。基于此洞察,我们构建了新型数据过滤网络,其可生成当前最先进的图文数据集。具体而言,最优性能数据集DFN-5B使我们能够在给定计算预算下训练出最先进的CLIP模型:在多项任务改进中,基于该数据集训练的ViT-H在ImageNet上实现了84.4%的零样本迁移准确率,优于在LAION-2B、DataComp-1B或OpenAI WIT等其他数据集上训练的模型。为促进数据集设计研究的进一步发展,我们还发布了含20亿样本的新数据集DFN-2B,并证明仅使用公开数据即可从头训练出高性能数据过滤网络。