With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.
翻译:随着大规模视频数据集的出现和扩散模型的进步,文本驱动的视频生成已取得显著进展。然而,现有视频生成模型通常在有限帧数上训练,导致推理时无法生成高保真长视频。此外,这些模型仅支持单文本条件,而现实场景中视频内容随时间变化往往需要多文本条件。为解决这些挑战,本研究探索扩展文本驱动能力,以生成基于多文本条件的长视频。1) 我们首先分析初始噪声在视频扩散模型中的影响,基于噪声观测提出FreeNoise——一种免调优且时间高效的范式,可在保持内容一致性的同时增强预训练视频扩散模型的生成能力。具体而言,我们并非初始化所有帧的噪声,而是重调度一组长程相关噪声序列,并通过基于窗口的函数对其执行时序注意力机制。2) 此外,我们设计了一种新颖的运动注入方法,支持基于多个文本提示条件的视频生成。大量实验验证了本范式在扩展视频扩散模型生成能力方面的优越性。值得注意的是,与先前最佳方法相比(其带来约255%的额外时间成本),本方法仅引入约17%的可忽略时间成本。生成的视频样本详见我们的网站:http://haonanqiu.com/projects/FreeNoise.html。