Breaking Free: How to Hack Safety Guardrails in Black-Box Diffusion Models!

Deep neural networks can be exploited using natural adversarial samples, which do not impact human perception. Current approaches often rely on deep neural networks' white-box nature to generate these adversarial samples or synthetically alter the distribution of adversarial samples compared to the training distribution. In contrast, we propose EvoSeed, a novel evolutionary strategy-based algorithmic framework for generating photo-realistic natural adversarial samples. Our EvoSeed framework uses auxiliary Conditional Diffusion and Classifier models to operate in a black-box setting. We employ CMA-ES to optimize the search for an initial seed vector, which, when processed by the Conditional Diffusion Model, results in the natural adversarial sample misclassified by the Classifier Model. Experiments show that generated adversarial images are of high image quality, raising concerns about generating harmful content bypassing safety classifiers. Our research opens new avenues to understanding the limitations of current safety mechanisms and the risk of plausible attacks against classifier systems using image generation. Project Website can be accessed at: https://shashankkotyan.github.io/EvoSeed.

翻译：深度神经网络可能被利用自然对抗样本进行攻击，这些样本不会影响人类感知。当前方法通常依赖深度神经网络的白盒特性来生成这些对抗样本，或者通过合成方式改变对抗样本的分布，使其与训练分布不同。相比之下，我们提出了EvoSeed，一种基于进化策略的新型算法框架，用于生成逼真的自然对抗样本。我们的EvoSeed框架利用辅助的条件扩散模型和分类器模型，在黑盒设置下运行。我们采用CMA-ES来优化初始种子向量的搜索，该向量经过条件扩散模型处理后，会产生被分类器模型误分类的自然对抗样本。实验表明，生成的对抗图像具有较高的图像质量，这引发了关于绕过安全分类器生成有害内容的担忧。我们的研究为了解当前安全机制的局限性以及利用图像生成对分类器系统进行可信攻击的风险开辟了新途径。项目网站可通过以下链接访问：https://shashankkotyan.github.io/EvoSeed。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/