Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.
翻译:尽管文本到视频(T2V)合成领域取得了巨大进展,但开源T2V扩散模型在生成包含动态变化与演进内容的较长视频时仍面临挑战。这些模型倾向于合成准静态视频,忽略了文本提示中暗含的必要视觉时间变化。与此同时,扩展这些模型以实现更长、更具动态性的视频合成往往面临计算不可行性问题。为解决这一挑战,我们提出"生成式时间调控"(GTN)概念,旨在推理过程中动态调整生成过程,以增强对时间动力学的控制能力,并实现更长视频的生成。我们提出名为VSTAR的GTN方法,其包含两个关键组件:1)视频摘要提示生成(VSP)——基于原始单提示利用大语言模型自动生成视频摘要,为长视频的不同视觉状态提供精准文本引导;2)时间注意力正则化(TAR)——一种优化预训练T2V扩散模型时间注意力单元的正则化技术,可实现对视频动力学的控制。实验证明,与现有开源T2V模型相比,本方法在生成更长、更具视觉吸引力的视频方面具有优越性。此外,我们通过分析使用/未使用VSTAR的时间注意力图,揭示了应用该方法以缓解对期望视觉时间变化忽略问题的重要性。