Image classifiers should be used with caution in the real world. Performance evaluated on a validation set may not reflect performance in the real world. In particular, classifiers may perform well for conditions that are frequently encountered during training, but poorly for other infrequent conditions. In this study, we hypothesize that recent advances in text-to-image generative models make them valuable for benchmarking computer vision models such as image classifiers: they can generate images conditioned by textual prompts that cause classifier failures, allowing failure conditions to be described with textual attributes. However, their generation cost becomes an issue when a large number of synthetic images need to be generated, which is the case when many different attribute combinations need to be tested. We propose an image classifier benchmarking method as an iterative process that alternates image generation, classifier evaluation, and attribute selection. This method efficiently explores the attributes that ultimately lead to poor behavior detection.
翻译:在实际应用中应谨慎使用图像分类器。基于验证集评估的性能可能无法反映其在真实场景中的表现。具体而言,分类器在训练过程中频繁出现的条件下可能表现良好,但在其他罕见条件下则表现较差。本研究提出假设:近期文本到图像生成模型的进展使其成为评估计算机视觉模型(如图像分类器)的重要工具——这些模型能根据导致分类器失效的文本提示生成条件化图像,从而允许通过文本属性描述故障条件。然而,当需要生成大量合成图像时(例如测试多种不同属性组合的情况),其生成成本将成为显著问题。我们提出一种迭代式图像分类器基准测试方法,该方法交替进行图像生成、分类器评估与属性选择。此方法能高效探索最终导致不良行为检测的属性组合。