Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.
翻译:开放权重的视频扩散模型能生成从暴力内容到虚假信息在内的逼真不安全内容,然而现有防御措施要么需要代价高昂的安全微调(这会损害通用能力),要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS(表征空间推理时安全引导)——一种免训练方法,通过将内部表征引导至安全生成方向,在推理时实现视频扩散模型的安全对齐。我们的关键发现是:安全相关结构以线性方式编码在视频扩散Transformer的隐状态激活中,通过基于二元安全标签的监督主成分分析发现的单一方向,足以分离安全与不安全的生成轨迹。推理时,在中间Transformer层向隐状态叠加该方向,可将生成内容从有害输出转向语义相关的安全替代方案——无需权重更新、无需概念枚举、计算开销可忽略。通过机制分析,我们发现:安全信息随Transformer深度单调累积,但引导效果在中间层(约50%深度)达到峰值,揭示了信息可用性与下游传播能力之间的根本权衡。我们在9个视频扩散模型、多种参数规模(1.3B-5B)及文生视频与图生视频两大任务上评估REINS——据我们所知,这是视频生成文献中范围最广的安全评估套件。