Latent Video Diffusion Models (LVDMs) have achieved state-of-the-art generative quality for image and video generation; however, they remain brittle under noisy conditioning, where small perturbations in text or multimodal embeddings can cascade over timesteps and cause semantic drift. Existing corruption strategies from image diffusion (Gaussian, Uniform) fail in video settings because static noise disrupts temporal fidelity. In this paper, we propose CAT-LVDM, a corruption-aware training framework with structured, data-aligned noise injection tailored for video diffusion. Our two operators, Batch-Centered Noise Injection (BCNI) and Spectrum-Aware Contextual Noise (SACN), align perturbations with batch semantics or spectral dynamics to preserve coherence. CAT-LVDM yields substantial gains: BCNI reduces FVD by 31.9 percent on WebVid-2M, MSR-VTT, and MSVD, while SACN improves UCF-101 by 12.3 percent, outperforming Gaussian, Uniform, and even large diffusion baselines like DEMO (2.3B) and Lavie (3B) despite training on 5x less data. Ablations confirm the unique value of low-rank, data-aligned noise, and theory establishes why these operators tighten robustness and generalization bounds. CAT-LVDM thus sets a new framework for robust video diffusion, and our experiments show that it can also be extended to autoregressive generation and multimodal video understanding LLMs. Code, models, and samples are available at https://github.com/chikap421/catlvdm
翻译:隐式视频扩散模型(LVDMs)在图像与视频生成领域已取得最先进的生成质量,但在噪声条件下仍显脆弱——文本或多模态嵌入中的微小扰动会随时间步长级联累积,导致语义漂移。现有图像扩散中的腐败策略(高斯噪声、均匀噪声)在视频场景中失效,因为静态噪声会破坏时间保真度。本文提出CAT-LVDM,一种针对视频扩散设计的结构化、数据对齐噪声注入的腐败感知训练框架。我们的两种算子——批次中心噪声注入(BCNI)与频谱感知语境噪声(SACN)——通过将扰动与批次语义或频谱动态对齐来保持连贯性。CAT-LVDM带来显著提升:在WebVid-2M、MSR-VTT和MSVD数据集上,BCNI将FVD降低31.9%;在UCF-101上,SACN提升12.3%,尽管训练数据量减少5倍,仍优于高斯噪声、均匀噪声,甚至超越DEMO(2.3B参数)和Lavie(3B参数)等大规模扩散基线。消融实验证实了低秩、数据对齐噪声的独特价值,理论分析确立了这些算子优化鲁棒性与泛化边界的原理。CAT-LVDM由此建立了鲁棒视频扩散的新框架,我们的实验表明该方法还可扩展至自回归生成与多模态视频理解大语言模型。代码、模型与样本详见 https://github.com/chikap421/catlvdm