Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.

翻译：近年来生成建模的进展使得安全控制成为核心挑战，然而现有方法仍高度依赖模型特异性，需针对每种新架构重新训练或定制干预措施。本研究探讨安全性能否被表征为可迁移的潜在方向——仅需学习一次即可跨异构生成器复用。我们首次提出跨模型安全引导框架：在源大语言模型中基于成对的安全-不安全提示估计安全方向，通过仅依赖良性数据训练的轻量级对齐将该方向传输至目标生成器，并在推理时施加引导。关键的是，我们的流程从不接触目标侧的不安全数据，从而隔离了安全性能否通过共享表征几何结构被迁移的问题。除单一全局方向外，我们进一步提出多向量扩展方法以捕获类别特异性安全行为，实现更具选择性的控制。我们在文本到图像与文本到视频生成任务中，对多种源-目标模型对进行了评估。实验表明，跨模型迁移的安全方向在ASR降低及CLIP分数/FID权衡方面均达到与目标模型本地（基于不安全数据学习的方向）相当的水平，且无需任何目标侧不安全数据。这表明安全性能提升不以牺牲生成质量为代价。研究结果指向模块化的安全观：安全相关行为并非纯粹模型局部化，而是可通过跨模型持存的潜在方向进行控制。这为构建无需目标侧不安全数据的轻量级可复用安全机制开辟了新路径。