Text-to-image (T2I) generative models have achieved remarkable visual fidelity, yet remain vulnerable to generating unsafe content. Existing safety defenses typically intervene internally within the generative model, but suffer from severe concept entanglement, leading to degradation of benign generation quality, a trade-off we term the Safety Tax. To overcome this limitation, we advocate a paradigm shift from destructive internal editing to external safety rectification. Following this principle, we propose SafePatch, a structurally isolated safety module that performs external, interpretable rectification without modifying the base model. The core backbone of SafePatch is architecturally instantiated as a trainable clone of the base model's encoder, allowing it to inherit rich semantic priors and maintain representation consistency. To enable interpretable safety rectification, we construct a strictly aligned counterfactual safety dataset (ACS) for differential supervision training. Across nudity and multi-category benchmarks and recent adversarial prompt attacks, SafePatch achieves robust unsafe suppression (7% unsafe on I2P) while preserving image quality and semantic alignment.
翻译:文本到图像(T2I)生成模型已实现显著的视觉保真度,但仍易生成不安全内容。现有的安全防御方法通常在生成模型内部进行干预,但存在严重的概念纠缠问题,导致良性生成质量下降,这一权衡我们称之为“安全税”。为克服此限制,我们主张一种从破坏性内部编辑转向外部安全矫正的范式转变。遵循此原则,我们提出了SafePatch,一个结构上隔离的安全模块,它执行外部、可解释的矫正,而无需修改基础模型。SafePatch的核心架构被实例化为基础模型编码器的可训练克隆,使其能够继承丰富的语义先验并保持表征一致性。为实现可解释的安全矫正,我们构建了一个严格对齐的反事实安全数据集(ACS)用于差分监督训练。在裸露内容和多类别基准测试以及最近的对抗性提示攻击中,SafePatch实现了稳健的不安全内容抑制(在I2P基准上不安全内容占比为7%),同时保持了图像质量和语义对齐。