Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data. Code is available at https://github.com/facebookresearch/ssl-data-curation.
翻译:自监督特征已成为现代机器学习系统的基石。这些特征通常在需要大量人工构建与策展的数据集上进行预训练。这种人工处理方式存在一些与监督学习类似的局限性,例如,众包数据筛选过程成本高昂且耗时,阻碍了数据集规模的扩展。本研究针对自监督预训练中高质量数据集的自动策展问题展开探讨。我们主张此类数据集应具备规模性、多样性与均衡性,并提出一种基于聚类的构建方法以满足所有标准。该方法通过对大规模多样化数据仓库进行连续分层$k$-均值聚类,获得在数据概念间均匀分布的簇结构,随后执行分层均衡采样。在网页图像、卫星图像及文本三个不同数据领域的实验表明,基于本方法自动构建数据集训练的特征显著优于未策展数据,且达到甚至超越人工策展数据的训练效果。代码发布于 https://github.com/facebookresearch/ssl-data-curation。