With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.
翻译:随着大规模视频数据集的可用性以及扩散模型的进展,文本驱动的视频生成已取得显著进步。然而,现有视频生成模型通常在有限帧数上进行训练,导致推理时无法生成高保真长视频。此外,这些模型仅支持单文本条件,而现实场景中由于视频内容随时间变化,往往需要多文本条件。为应对这些挑战,本研究探索了扩展文本驱动能力以基于多文本条件生成更长视频的潜力。1) 我们首先分析了初始噪声对视频扩散模型的影响,并基于噪声观察提出FreeNoise——一种无需调优且时间高效的范式,可在保持内容一致性的同时增强预训练视频扩散模型的生成能力。具体而言,我们不再为所有帧初始化噪声,而是重新调度噪声序列以实现长程相关性,并通过基于窗口的函数对其执行时序注意力。2) 此外,我们设计了一种新颖的运动注入方法,以支持基于多文本提示的视频生成。大量实验验证了本范式在扩展视频扩散模型生成能力方面的优越性。值得注意的是,与先前最佳方法(额外时间成本约255%)相比,我们的方法仅产生约17%的可忽略时间成本。生成的视频样本可在我们的网站获取:http://haonanqiu.com/projects/FreeNoise.html。