Controlling the behavior of text-to-image generative models is critical for safe and practical deployment. Existing safety approaches typically rely on model fine-tuning or curated datasets, which can degrade generation quality or limit scalability. We propose an inference-time steering framework that leverages gradient feedback from frozen pretrained foundation models to guide the generation process without modifying the underlying generator. Our key observation is that vision-language foundation models encode rich semantic representations that can be repurposed as off-the-shelf supervisory signals during generation. By injecting such feedback through clean latent estimates at each sampling step, our method formulates safety steering as an energy-based sampling problem. This design enables modular, training-free safety control that is compatible with both diffusion and flow-matching models and can generalize across diverse visual concepts. Experiments demonstrate state-of-the-art robustness against NSFW red-teaming benchmarks and effective multi-target steering, while preserving high generation quality on benign non-targeted prompts. Our framework provides a principled approach for utilizing foundation models as semantic energy estimators, enabling reliable and scalable safety control for text-to-image generation.
翻译:控制文本到图像生成模型的行为对于安全且实用的部署至关重要。现有安全方法通常依赖模型微调或精心筛选的数据集,这可能导致生成质量下降或限制可扩展性。我们提出一种推理时引导框架,利用冻结的预训练基础模型的梯度反馈来指导生成过程,而无需修改底层生成器。我们的关键发现是:视觉-语言基础模型编码了丰富的语义表征,可直接用作生成过程中的现成监督信号。通过在每次采样步骤中将此类反馈注入干净的潜在估计,我们的方法将安全引导形式化为基于能量的采样问题。该设计实现了模块化、无需训练的安全控制,兼容扩散模型与流匹配模型,并能泛化至多种视觉概念。实验表明,该方法在NSFW红队测试基准上具有最先进的鲁棒性,同时实现有效的多目标引导,且对良性非目标提示保持高生成质量。我们的框架提供了一种利用基础模型作为语义能量估计器的规范化方法,实现了可靠且可扩展的文本到图像生成安全控制。