Data poisoning attacks manipulate training data to introduce unexpected behaviors into machine learning models at training time. For text-to-image generative models with massive training datasets, current understanding of poisoning attacks suggests that a successful attack would require injecting millions of poison samples into their training pipeline. In this paper, we show that poisoning attacks can be successful on generative models. We observe that training data per concept can be quite limited in these models, making them vulnerable to prompt-specific poisoning attacks, which target a model's ability to respond to individual prompts. We introduce Nightshade, an optimized prompt-specific poisoning attack where poison samples look visually identical to benign images with matching text prompts. Nightshade poison samples are also optimized for potency and can corrupt an Stable Diffusion SDXL prompt in <100 poison samples. Nightshade poison effects "bleed through" to related concepts, and multiple attacks can composed together in a single prompt. Surprisingly, we show that a moderate number of Nightshade attacks can destabilize general features in a text-to-image generative model, effectively disabling its ability to generate meaningful images. Finally, we propose the use of Nightshade` and similar tools as a last defense for content creators against web scrapers that ignore opt-out/do-not-crawl directives, and discuss possible implications for model trainers and content creators.
翻译:数据投毒攻击通过在训练时操控训练数据,使机器学习模型产生意外行为。对于训练数据集规模庞大的文本到图像生成模型而言,现有关于投毒攻击的理解表明,成功攻击需要向训练流程中注入数百万个毒样本。本文证明,投毒攻击在生成模型上同样能够奏效。我们观察到,这些模型中每个概念的训练数据可能相当有限,这使得它们易受针对模型响应特定提示能力的"提示特定投毒攻击"影响。我们提出Nightshade——一种优化的提示特定投毒攻击方法,其毒样本在视觉上与匹配文本提示的正常图像完全一致。Nightshade毒样本还针对毒性效力进行了优化,可在少于100个毒样本的情况下破坏Stable Diffusion SDXL的单个提示。Nightshade的毒效应会"渗透"至相关概念,且多次攻击可组合至同一提示中。令人惊讶的是,我们发现中等数量的Nightshade攻击即可破坏文本到图像生成模型的通用特征,使其无法生成有意义的图像。最后,我们提出将Nightshade及类似工具用作内容创作者对抗忽略退选/不爬取指令的网络爬虫的最后防线,并讨论了其对模型训练者与内容创作者的潜在影响。