This study investigates the robustness of image classifiers to text-guided corruptions. We utilize diffusion models to edit images to different domains. Unlike other works that use synthetic or hand-picked data for benchmarking, we use diffusion models as they are generative models capable of learning to edit images while preserving their semantic content. Thus, the corruptions will be more realistic and the comparison will be more informative. Also, there is no need for manual labeling and we can create large-scale benchmarks with less effort. We define a prompt hierarchy based on the original ImageNet hierarchy to apply edits in different domains. As well as introducing a new benchmark we try to investigate the robustness of different vision models. The results of this study demonstrate that the performance of image classifiers decreases significantly in different language-based corruptions and edit domains. We also observe that convolutional models are more robust than transformer architectures. Additionally, we see that common data augmentation techniques can improve the performance on both the original data and the edited images. The findings of this research can help improve the design of image classifiers and contribute to the development of more robust machine learning systems. The code for generating the benchmark will be made available online upon publication.
翻译:本研究探究图像分类器对文本引导扰动的鲁棒性。我们利用扩散模型将图像编辑至不同领域。与使用合成数据或人工筛选数据进行基准测试的其他工作不同,我们选用扩散模型,因其作为生成模型能够在保留图像语义内容的同时学习编辑图像。由此产生的扰动更具真实性,能够提供更具信息量的对比结果。同时,该方法无需人工标注,能以更低成本构建大规模基准测试。我们基于原始ImageNet层级结构定义提示层级体系,以实现在不同领域应用编辑操作。除提出新基准外,我们系统考察了不同视觉模型的鲁棒性。研究结果表明,在不同语言驱动型扰动与编辑领域中,图像分类器的性能显著下降。我们还发现卷积模型比Transformer架构具有更强的鲁棒性。此外,常见数据增强技术既能提升原始数据的性能,也能改善编辑图像的表现。本研究成果有助于优化图像分类器设计,推动构建更鲁棒的机器学习系统。用于生成基准测试的代码将在论文发表后公开提供。