Data curation is the problem of how to collect and organize samples into a dataset that supports efficient learning. Despite the centrality of the task, little work has been devoted towards a large-scale, systematic comparison of various curation methods. In this work, we take steps towards a formal evaluation of data curation strategies and introduce SELECT, the first large-scale benchmark of curation strategies for image classification. In order to generate baseline methods for the SELECT benchmark, we create a new dataset, ImageNet++, which constitutes the largest superset of ImageNet-1K to date. Our dataset extends ImageNet with 5 new training-data shifts, each approximately the size of ImageNet-1K itself, and each assembled using a distinct curation strategy. We evaluate our data curation baselines in two ways: (i) using each training-data shift to train identical image classification models from scratch (ii) using the data itself to fit a pretrained self-supervised representation. Our findings show interesting trends, particularly pertaining to recent methods for data curation such as synthetic data generation and lookup based on CLIP embeddings. We show that although these strategies are highly competitive for certain tasks, the curation strategy used to assemble the original ImageNet-1K dataset remains the gold standard. We anticipate that our benchmark can illuminate the path for new methods to further reduce the gap. We release our checkpoints, code, documentation, and a link to our dataset at https://github.com/jimmyxu123/SELECT.
翻译:数据策展是指如何收集和组织样本以构建支持高效学习的数据集的问题。尽管该任务具有核心重要性,但针对各种策展方法的大规模系统比较研究却很少。在本工作中,我们朝着数据策展策略的形式化评估迈出步伐,并引入了SELECT——首个面向图像分类的策展策略大规模基准测试。为了为SELECT基准生成基线方法,我们创建了一个新数据集ImageNet++,这是迄今为止最大的ImageNet-1K超集。我们的数据集通过5种新的训练数据偏移扩展了ImageNet,每种偏移的规模均与ImageNet-1K本身相当,且各自采用不同的策展策略构建。我们通过两种方式评估数据策展基线:(i)使用每种训练数据偏移从头开始训练相同的图像分类模型;(ii)利用数据本身来拟合预训练的自监督表示。我们的研究结果揭示了有趣的趋势,特别是关于近期数据策展方法(如基于CLIP嵌入的合成数据生成和检索)的表现。研究表明,尽管这些策略在特定任务上极具竞争力,但用于构建原始ImageNet-1K数据集的策展策略仍是黄金标准。我们预期该基准能为新方法指明方向,以进一步缩小差距。我们在https://github.com/jimmyxu123/SELECT 发布了检查点、代码、文档及数据集链接。