With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.
翻译:随着大规模视频数据集的出现和扩散模型的进步,文本驱动的视频生成已取得显著进展。然而,现有视频生成模型通常基于有限帧数进行训练,导致推理时无法生成高保真长视频。此外,这些模型仅支持单一文本条件,而现实场景中常需根据视频内容随时间变化的多文本条件。为解决这些挑战,本研究探索扩展文本驱动能力以生成基于多文本条件的更优长视频。1)我们首先分析初始噪声对视频扩散模型的影响,基于噪声观测提出FreeNoise——一种无调优且时间高效的范式,能在保持内容一致性的同时增强预训练视频扩散模型的生成能力。具体而言,我们不直接初始化所有帧的噪声,而是通过重排噪声序列实现长程相关性,并基于窗口函数对其执行时序注意力。2)此外,我们设计了一种新颖的运动注入方法以支持基于多文本提示的视频生成。大量实验验证了本范式在扩展视频扩散模型生成能力方面的优越性。值得注意的是,与先前最佳方法(需额外约255%时间成本)相比,本方法仅需约17%的微小时延。生成的视频样本见项目网站:http://haonanqiu.com/projects/FreeNoise.html。