FreeNoise: Tuning-Free Longer Video Diffusion via Noise Rescheduling

With the availability of large-scale video datasets and the advances of diffusion models, text-driven video generation has achieved substantial progress. However, existing video generation models are typically trained on a limited number of frames, resulting in the inability to generate high-fidelity long videos during inference. Furthermore, these models only support single-text conditions, whereas real-life scenarios often require multi-text conditions as the video content changes over time. To tackle these challenges, this study explores the potential of extending the text-driven capability to generate longer videos conditioned on multiple texts. 1) We first analyze the impact of initial noise in video diffusion models. Then building upon the observation of noise, we propose FreeNoise, a tuning-free and time-efficient paradigm to enhance the generative capabilities of pretrained video diffusion models while preserving content consistency. Specifically, instead of initializing noises for all frames, we reschedule a sequence of noises for long-range correlation and perform temporal attention over them by window-based function. 2) Additionally, we design a novel motion injection method to support the generation of videos conditioned on multiple text prompts. Extensive experiments validate the superiority of our paradigm in extending the generative capabilities of video diffusion models. It is noteworthy that compared with the previous best-performing method which brought about 255% extra time cost, our method incurs only negligible time cost of approximately 17%. Generated video samples are available at our website: http://haonanqiu.com/projects/FreeNoise.html.

翻译：随着大规模视频数据集的出现和扩散模型的进步，文本驱动的视频生成已取得显著进展。然而，现有视频生成模型通常在有限帧数上训练，导致推理时无法生成高保真长视频。此外，这些模型仅支持单文本条件，而现实场景中视频内容随时间变化往往需要多文本条件。为解决这些挑战，本研究探索扩展文本驱动能力，以生成基于多文本条件的长视频。1) 我们首先分析初始噪声在视频扩散模型中的影响，基于噪声观测提出FreeNoise——一种免调优且时间高效的范式，可在保持内容一致性的同时增强预训练视频扩散模型的生成能力。具体而言，我们并非初始化所有帧的噪声，而是重调度一组长程相关噪声序列，并通过基于窗口的函数对其执行时序注意力机制。2) 此外，我们设计了一种新颖的运动注入方法，支持基于多个文本提示条件的视频生成。大量实验验证了本范式在扩展视频扩散模型生成能力方面的优越性。值得注意的是，与先前最佳方法相比（其带来约255%的额外时间成本），本方法仅引入约17%的可忽略时间成本。生成的视频样本详见我们的网站：http://haonanqiu.com/projects/FreeNoise.html。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日