We propose Instruct2Attack (I2A), a language-guided semantic attack that generates semantically meaningful perturbations according to free-form language instructions. We make use of state-of-the-art latent diffusion models, where we adversarially guide the reverse diffusion process to search for an adversarial latent code conditioned on the input image and text instruction. Compared to existing noise-based and semantic attacks, I2A generates more natural and diverse adversarial examples while providing better controllability and interpretability. We further automate the attack process with GPT-4 to generate diverse image-specific text instructions. We show that I2A can successfully break state-of-the-art deep neural networks even under strong adversarial defenses, and demonstrate great transferability among a variety of network architectures.
翻译:我们提出Instruct2Attack(I2A),一种语言引导的语义攻击方法,能够根据自由形式语言指令生成具有语义意义的扰动。我们利用最先进的潜在扩散模型,通过对抗性地引导反向扩散过程,搜索基于输入图像和文本指令的对抗性潜在编码。与现有的基于噪声和语义攻击的方法相比,I2A在提供更好可控性和可解释性的同时,生成更自然且多样化的对抗样本。我们进一步利用GPT-4自动化攻击过程,生成多样化的图像特定文本指令。研究表明,即使在强对抗防御下,I2A也能成功攻破最先进的深度神经网络,并在多种网络架构之间展现出优异的迁移性。