UnsafeBench: Benchmarking Image Safety Classifiers on Real-World and AI-Generated Images

With the advent of text-to-image models and concerns about their misuse, developers are increasingly relying on image safety classifiers to moderate their generated unsafe images. Yet, the performance of current image safety classifiers remains unknown for both real-world and AI-generated images. In this work, we propose UnsafeBench, a benchmarking framework that evaluates the effectiveness and robustness of image safety classifiers, with a particular focus on the impact of AI-generated images on their performance. First, we curate a large dataset of 10K real-world and AI-generated images that are annotated as safe or unsafe based on a set of 11 unsafe categories of images (sexual, violent, hateful, etc.). Then, we evaluate the effectiveness and robustness of five popular image safety classifiers, as well as three classifiers that are powered by general-purpose visual language models. Our assessment indicates that existing image safety classifiers are not comprehensive and effective enough to mitigate the multifaceted problem of unsafe images. Also, there exists a distribution shift between real-world and AI-generated images in image qualities, styles, and layouts, leading to degraded effectiveness and robustness. Motivated by these findings, we build a comprehensive image moderation tool called PerspectiveVision, which addresses the main drawbacks of existing classifiers with improved effectiveness and robustness, especially on AI-generated images. UnsafeBench and PerspectiveVision can aid the research community in better understanding the landscape of image safety classification in the era of generative AI.

翻译：随着文本到图像模型的出现及其潜在滥用的担忧，开发者日益依赖图像安全分类器来对其生成的不安全图像进行内容审核。然而，当前图像安全分类器在真实世界图像和AI生成图像上的性能表现尚不明确。本研究提出UnsafeBench，一个用于评估图像安全分类器效能与鲁棒性的基准测试框架，特别关注AI生成图像对其性能的影响。首先，我们构建了一个包含1万张真实世界与AI生成图像的大规模数据集，这些图像根据11类不安全图像类别（色情、暴力、仇恨内容等）进行安全/不安全标注。随后，我们评估了五种主流图像安全分类器以及三种基于通用视觉语言模型的分类器的效能与鲁棒性。评估结果表明，现有图像安全分类器在应对不安全图像这一多维度问题时，其全面性与有效性均显不足。此外，真实世界图像与AI生成图像在图像质量、风格和布局方面存在分布偏移，导致分类器的效能与鲁棒性下降。基于这些发现，我们构建了一个名为PerspectiveVision的综合图像审核工具，该工具通过提升效能与鲁棒性（尤其在AI生成图像上）解决了现有分类器的主要缺陷。UnsafeBench与PerspectiveVision可助力研究社区更好地理解生成式AI时代的图像安全分类研究格局。