Leaving Reality to Imagination: Robust Classification via Generated Datasets

Recent research on robustness has revealed significant performance gaps between neural image classifiers trained on datasets that are similar to the test set, and those that are from a naturally shifted distribution, such as sketches, paintings, and animations of the object categories observed during training. Prior work focuses on reducing this gap by designing engineered augmentations of training data or through unsupervised pretraining of a single large model on massive in-the-wild training datasets scraped from the Internet. However, the notion of a dataset is also undergoing a paradigm shift in recent years. With drastic improvements in the quality, ease-of-use, and access to modern generative models, generated data is pervading the web. In this light, we study the question: How do these generated datasets influence the natural robustness of image classifiers? We find that Imagenet classifiers trained on real data augmented with generated data achieve higher accuracy and effective robustness than standard training and popular augmentation strategies in the presence of natural distribution shifts. We analyze various factors influencing these results, including the choice of conditioning strategies and the amount of generated data. Lastly, we introduce and analyze an evolving generated dataset, ImageNet-G-v1, to better benchmark the design, utility, and critique of standalone generated datasets for robust and trustworthy machine learning. The code and datasets are available at https://github.com/Hritikbansal/generative-robustness.

翻译：近期关于鲁棒性的研究揭示了神经网络图像分类器在性能上存在显著差距：使用与测试集相似的数据集训练的分类器，与使用自然分布偏移数据（如训练期间所见物体类别的素描、绘画和动画）训练的分类器表现不同。先前的研究通过设计工程化的训练数据增强方法，或利用从互联网抓取的大规模野外训练数据集对单一大型模型进行无监督预训练，致力于缩小这一差距。然而，近年来数据集的概念本身也正经历范式转变。随着现代生成模型在质量、易用性和可及性方面的显著提升，生成数据已遍布网络。在此背景下，我们探讨以下问题：这些生成数据集如何影响图像分类器的自然鲁棒性？我们发现，使用真实数据与生成数据联合训练的ImageNet分类器，在自然分布偏移场景下，相较于标准训练和流行的增强策略，能实现更高的准确率和有效鲁棒性。我们分析了影响这些结果的多重因素，包括条件策略的选择和生成数据的数量。最后，我们引入并分析了一个逐步演化的生成数据集ImageNet-G-v1，以更好地评估、设计、利用和批判独立生成数据集在实现鲁棒可信机器学习方面的作用。相关代码和数据集已开源在https://github.com/Hritikbansal/generative-robustness。