Diffusion models have revolutionized image generation, and their extension to video generation has shown promise. However, current video diffusion models~(VDMs) rely on a scalar timestep variable applied at the clip level, which limits their ability to model complex temporal dependencies needed for various tasks like image-to-video generation. To address this limitation, we propose a frame-aware video diffusion model~(FVDM), which introduces a novel vectorized timestep variable~(VTV). Unlike conventional VDMs, our approach allows each frame to follow an independent noise schedule, enhancing the model's capacity to capture fine-grained temporal dependencies. FVDM's flexibility is demonstrated across multiple tasks, including standard video generation, image-to-video generation, video interpolation, and long video synthesis. Through a diverse set of VTV configurations, we achieve superior quality in generated videos, overcoming challenges such as catastrophic forgetting during fine-tuning and limited generalizability in zero-shot methods.Our empirical evaluations show that FVDM outperforms state-of-the-art methods in video generation quality, while also excelling in extended tasks. By addressing fundamental shortcomings in existing VDMs, FVDM sets a new paradigm in video synthesis, offering a robust framework with significant implications for generative modeling and multimedia applications.
翻译:扩散模型已彻底改变了图像生成领域,其向视频生成的扩展也展现出巨大潜力。然而,当前视频扩散模型(VDMs)依赖于在片段层级应用的标量时间步变量,这限制了其建模复杂时序依赖关系的能力,而这种能力对于图像到视频生成等多种任务至关重要。为突破这一局限,我们提出了一种帧感知视频扩散模型(FVDM),其核心是引入一种新颖的向量化时间步变量(VTV)。与传统的VDMs不同,我们的方法允许每一帧遵循独立的噪声调度,从而增强了模型捕捉细粒度时序依赖的能力。FVDM的灵活性在多项任务中得到验证,包括标准视频生成、图像到视频生成、视频插值以及长视频合成。通过多样化的VTV配置,我们在生成视频质量上实现了显著提升,克服了微调过程中的灾难性遗忘以及零样本方法泛化能力有限等挑战。实证评估表明,FVDM在视频生成质量上超越了现有最先进方法,同时在扩展任务中也表现出色。通过解决现有VDMs的根本性缺陷,FVDM为视频合成确立了新范式,提供了一个稳健的框架,对生成建模和多媒体应用具有深远意义。