Latent Video Diffusion Models (LVDMs) have achieved state-of-the-art generative quality for image and video generation; however, they remain brittle under noisy conditioning, where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion (e.g., Gaussian, Uniform) fail in video settings because static noise disrupts temporal fidelity. In this paper, we propose CAT-LVDM, a corruption-aware training framework with structured, data-aligned noise injection tailored for video diffusion. Our two operators, Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN), align perturbations with batch semantics or spectral dynamics to preserve coherence. CAT-LVDM yields substantial gains: BCNI reduces FVD by 31.9 percent on WebVid-2M, MSR-VTT, and MSVD, while SACN improves UCF-101 by 12.3 percent, outperforming Gaussian, Uniform, and large diffusion baselines such as DEMO (2.3B) and LaVie (3B) despite training on 5x less data. Ablations confirm the unique value of low-rank, data-aligned noise, and theoretical analysis establishes why these operators tighten robustness and generalization bounds. CAT-LVDM thus introduces a principled framework for robust video diffusion and further demonstrates transferability to autoregressive generation and multimodal video understanding models.
翻译:潜在视频扩散模型(LVDMs)在图像和视频生成领域已取得最先进的生成质量;然而,它们在噪声条件输入下仍然脆弱,文本或多模态嵌入中的微小扰动会在时间步上累积并导致语义漂移。图像扩散中现有的干扰策略(如高斯噪声、均匀噪声)在视频场景中失效,因为静态噪声会破坏时序保真度。本文提出CAT-LVDM,一种专为视频扩散设计的、采用结构化数据对齐噪声注入的抗干扰训练框架。我们提出的两种算子——批中心噪声注入(BCNI)与频谱感知上下文噪声(SACN)——通过将扰动与批次语义或频谱动态对齐来保持连贯性。CAT-LVDM带来显著提升:在WebVid-2M、MSR-VTT和MSVD数据集上,BCNI将FVD指标降低31.9%;在UCF-101数据集上,SACN将性能提升12.3%,其表现优于高斯噪声、均匀噪声以及大规模扩散基线模型(如DEMO(2.3B)和LaVie(3B)),且训练数据量仅为后者的五分之一。消融实验证实了低秩数据对齐噪声的独特价值,理论分析阐明了这些算子如何增强模型鲁棒性并收紧泛化边界。因此,CAT-LVDM为鲁棒视频扩散提供了一个原理性框架,并进一步展示了其向自回归生成模型与多模态视频理解模型的可迁移性。