SafeRedir: Prompt Embedding Redirection for Robust Unlearning in Image Generation Models

Image generation models (IGMs), while capable of producing impressive and creative content, often memorize a wide range of undesirable concepts from their training data, leading to the reproduction of unsafe content such as NSFW imagery and copyrighted artistic styles. Such behaviors pose persistent safety and compliance risks in real-world deployments and cannot be reliably mitigated by post-hoc filtering, owing to the limited robustness of such mechanisms and a lack of fine-grained semantic control. Recent unlearning methods seek to erase harmful concepts at the model level, which exhibit the limitations of requiring costly retraining, degrading the quality of benign generations, or failing to withstand prompt paraphrasing and adversarial attacks. To address these challenges, we introduce SafeRedir, a lightweight inference-time framework for robust unlearning via prompt embedding redirection. Without modifying the underlying IGMs, SafeRedir adaptively routes unsafe prompts toward safe semantic regions through token-level interventions in the embedding space. The framework comprises two core components: a latent-aware multi-modal safety classifier for identifying unsafe generation trajectories, and a token-level delta generator for precise semantic redirection, equipped with auxiliary predictors for token masking and adaptive scaling to localize and regulate the intervention. Empirical results across multiple representative unlearning tasks demonstrate that SafeRedir achieves effective unlearning capability, high semantic and perceptual preservation, robust image quality, and enhanced resistance to adversarial attacks. Furthermore, SafeRedir generalizes effectively across a variety of diffusion backbones and existing unlearned models, validating its plug-and-play compatibility and broad applicability. Code and data are available at https://github.com/ryliu68/SafeRedir.

翻译：图像生成模型（IGMs）虽然能够生成令人印象深刻且富有创意的内容，但其训练数据中往往记忆了大量不良概念，导致生成不安全内容（如NSFW图像和受版权保护的艺术风格）的再现。这种行为在实际部署中构成持续的安全与合规风险，且由于事后过滤机制的鲁棒性有限以及缺乏细粒度的语义控制，无法通过此类机制可靠地缓解。近期的遗忘方法试图在模型层面消除有害概念，但这些方法存在局限性：需要昂贵的重新训练、会降低良性生成的质量，或无法抵御提示词改写和对抗性攻击。为应对这些挑战，我们提出了SafeRedir——一个轻量级的推理时框架，通过提示嵌入重定向实现鲁棒性遗忘。在不修改底层IGMs的前提下，SafeRedir通过在嵌入空间中进行词元级干预，自适应地将不安全提示路由至安全的语义区域。该框架包含两个核心组件：一个用于识别不安全生成轨迹的潜在感知多模态安全分类器，以及一个用于精确语义重定向的词元级增量生成器；后者配备了用于词元掩码和自适应缩放的辅助预测器，以定位和调节干预过程。在多个代表性遗忘任务上的实证结果表明，SafeRedir实现了有效的遗忘能力、高度的语义与感知保真度、鲁棒的图像质量以及增强的抗对抗攻击能力。此外，SafeRedir能够有效泛化至多种扩散主干模型和现有已遗忘模型，验证了其即插即用的兼容性和广泛的适用性。代码和数据可在 https://github.com/ryliu68/SafeRedir 获取。