We revisit the "dataset classification" experiment suggested by Torralba and Efros a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and DataComp datasets. Our further experiments show that such a dataset classifier could learn semantic features that are generalizable and transferable, which cannot be simply explained by memorization. We hope our discovery will inspire the community to rethink the issue involving dataset bias and model capabilities.
翻译:我们重新审视了Torralba和Efros十年前提出的“数据集分类”实验,如今进入了拥有大规模、多样化且希望偏见更少的数据集,以及能力更强的神经网络架构的新时代。令人惊讶的是,我们观察到现代神经网络在判定图像源自哪个数据集方面能够达到出色的准确率:例如,在由YFCC、CC和DataComp数据集组成的三分类问题中,我们在保留的验证数据上报告了84.7%的准确率。进一步的实验表明,这种数据集分类器能够学习具有泛化和可迁移能力的语义特征,这无法简单用记忆效应来解释。我们希望这一发现能激励学界重新思考涉及数据集偏见和模型能力的问题。