Despite the advancements in Text-to-Image (T2I) generation models, their potential for misuse or even abuse raises serious safety concerns. Model developers have made tremendous efforts to introduce safety mechanisms that can address these concerns in T2I models. However, the existing safety mechanisms, whether external or internal, either remain susceptible to evasion under distribution shifts or require extensive model-specific adjustments. To address these limitations, we introduce Safe-Control, an innovative plug-and-play safety patch designed to mitigate unsafe content generation in T2I models. Using data-driven strategies and safety-aware conditions, Safe-Control injects safety control signals into the locked T2I model, acting as an update in a patch-like manner. Model developers can also construct various safety patches to meet the evolving safety requirements, which can be flexibly merged into a single, unified patch. Its plug-and-play design further ensures adaptability, making it compatible with other T2I models of similar denoising architecture. We conduct extensive evaluations on six diverse and public T2I models. Empirical results highlight that Safe-Control is effective in reducing unsafe content generation across six diverse T2I models with similar generative architectures, yet it successfully maintains the quality and text alignment of benign images. Compared to seven state-of-the-art safety mechanisms, including both external and internal defenses, Safe-Control significantly outperforms all baselines in reducing unsafe content generation. For example, it reduces the probability of unsafe content generation to 7%, compared to approximately 20% for most baseline methods, under both unsafe prompts and the latest adversarial attacks.
翻译:尽管文本到图像(T2I)生成模型取得了显著进展,但其潜在的误用甚至滥用引发了严重的安全担忧。模型开发者已付出巨大努力,在T2I模型中引入能够应对这些担忧的安全机制。然而,现有的安全机制,无论是外部还是内部机制,要么在分布偏移下仍易受规避,要么需要进行大量模型特定的调整。为克服这些局限,我们提出了Safe-Control,一种创新的即插即用安全补丁,旨在缓解T2I模型中的不安全内容生成。利用数据驱动策略和安全感知条件,Safe-Control将安全控制信号注入到锁定的T2I模型中,以类似补丁更新的方式发挥作用。模型开发者还可以构建多种安全补丁以满足不断变化的安全需求,这些补丁能够灵活地合并为单一的统一补丁。其即插即用设计进一步确保了适应性,使其能够兼容其他具有类似去噪架构的T2I模型。我们在六个多样化且公开的T2I模型上进行了广泛评估。实证结果表明,Safe-Control能有效减少具有相似生成架构的六个不同T2I模型中的不安全内容生成,同时成功保持了良性图像的质量和文本对齐性。与七种最先进的安全机制(包括外部和内部防御)相比,Safe-Control在减少不安全内容生成方面显著优于所有基线方法。例如,在不安全提示和最新对抗性攻击下,它将不安全内容生成概率降低至7%,而大多数基线方法约为20%。