Despite diffusion models having shown powerful abilities to generate photorealistic images, generating videos that are realistic and diverse still remains in its infancy. One of the key reasons is that current methods intertwine spatial content and temporal dynamics together, leading to a notably increased complexity of text-to-video generation (T2V). In this work, we propose HiGen, a diffusion model-based method that improves performance by decoupling the spatial and temporal factors of videos from two perspectives, i.e., structure level and content level. At the structure level, we decompose the T2V task into two steps, including spatial reasoning and temporal reasoning, using a unified denoiser. Specifically, we generate spatially coherent priors using text during spatial reasoning and then generate temporally coherent motions from these priors during temporal reasoning. At the content level, we extract two subtle cues from the content of the input video that can express motion and appearance changes, respectively. These two cues then guide the model's training for generating videos, enabling flexible content variations and enhancing temporal stability. Through the decoupled paradigm, HiGen can effectively reduce the complexity of this task and generate realistic videos with semantics accuracy and motion stability. Extensive experiments demonstrate the superior performance of HiGen over the state-of-the-art T2V methods.
翻译:尽管扩散模型在生成逼真图像方面展现出强大能力,但生成真实且多样化的视频仍处于初级阶段。其中一个关键原因在于当前方法将空间内容与时间动态交织在一起,导致文生视频(T2V)任务的复杂性显著增加。本文提出基于扩散模型的HiGen方法,通过从结构和内容两个层面解耦视频的时空因素来提升性能。在结构层面,我们采用统一去噪器将T2V任务分解为空间推理与时间推理两个步骤:首先利用文本生成空间一致的先验信息,再基于这些先验生成时间一致的动态。在内容层面,我们从输入视频内容中分别提取能表达运动变化和外观变化的两种细微线索,引导模型训练生成视频,从而实现灵活的内容变化并增强时间稳定性。通过这种解耦范式,HiGen能有效降低任务复杂性,生成兼具语义准确性和运动稳定性的真实视频。大量实验表明,HiGen在性能上优于当前最先进的T2V方法。