Self-supervised features are the cornerstone of modern machine learning systems. They are typically pre-trained on data collections whose construction and curation typically require extensive human effort. This manual process has some limitations similar to those encountered in supervised learning, e.g., the crowd-sourced selection of data is costly and time-consuming, preventing scaling the dataset size. In this work, we consider the problem of automatic curation of high-quality datasets for self-supervised pre-training. We posit that such datasets should be large, diverse and balanced, and propose a clustering-based approach for building ones satisfying all these criteria. Our method involves successive and hierarchical applications of $k$-means on a large and diverse data repository to obtain clusters that distribute uniformly among data concepts, followed by a hierarchical, balanced sampling step from these clusters. Extensive experiments on three different data domains including web-based images, satellite images and text show that features trained on our automatically curated datasets outperform those trained on uncurated data while being on par or better than ones trained on manually curated data.
翻译:自监督特征是现代机器学习系统的基石。这些特征通常在需要大量人工构建与策展的数据集上进行预训练。这种人工处理方式存在一些与监督学习类似的局限性,例如,通过众包方式筛选数据成本高昂且耗时,阻碍了数据集规模的扩展。本研究针对自监督预训练中高质量数据集的自动策展问题展开探讨。我们主张此类数据集应具备大规模、多样性和均衡性,并提出一种基于聚类的方法来构建满足所有这些标准的数据集。该方法通过对大规模多样化数据仓库进行连续分层$k$-均值聚类,获得在数据概念间均匀分布的簇,随后对这些簇进行分层均衡采样。在包括网络图像、卫星图像和文本在内的三种不同数据领域上的大量实验表明,基于我们自动策展数据集训练的特征性能优于未策展数据训练的特征,同时与人工策展数据训练的特征性能相当或更优。