VSTAR: Generative Temporal Nursing for Longer Dynamic Video Synthesis

Despite tremendous progress in the field of text-to-video (T2V) synthesis, open-sourced T2V diffusion models struggle to generate longer videos with dynamically varying and evolving content. They tend to synthesize quasi-static videos, ignoring the necessary visual change-over-time implied in the text prompt. At the same time, scaling these models to enable longer, more dynamic video synthesis often remains computationally intractable. To address this challenge, we introduce the concept of Generative Temporal Nursing (GTN), where we aim to alter the generative process on the fly during inference to improve control over the temporal dynamics and enable generation of longer videos. We propose a method for GTN, dubbed VSTAR, which consists of two key ingredients: 1) Video Synopsis Prompting (VSP) - automatic generation of a video synopsis based on the original single prompt leveraging LLMs, which gives accurate textual guidance to different visual states of longer videos, and 2) Temporal Attention Regularization (TAR) - a regularization technique to refine the temporal attention units of the pre-trained T2V diffusion models, which enables control over the video dynamics. We experimentally showcase the superiority of the proposed approach in generating longer, visually appealing videos over existing open-sourced T2V models. We additionally analyze the temporal attention maps realized with and without VSTAR, demonstrating the importance of applying our method to mitigate neglect of the desired visual change over time.

翻译：尽管文本到视频（T2V）合成领域取得了巨大进展，但开源T2V扩散模型在生成包含动态变化与演进内容的较长视频时仍面临挑战。这些模型倾向于合成准静态视频，忽略了文本提示中暗含的必要视觉时间变化。与此同时，扩展这些模型以实现更长、更具动态性的视频合成往往面临计算不可行性问题。为解决这一挑战，我们提出"生成式时间调控"（GTN）概念，旨在推理过程中动态调整生成过程，以增强对时间动力学的控制能力，并实现更长视频的生成。我们提出名为VSTAR的GTN方法，其包含两个关键组件：1）视频摘要提示生成（VSP）——基于原始单提示利用大语言模型自动生成视频摘要，为长视频的不同视觉状态提供精准文本引导；2）时间注意力正则化（TAR）——一种优化预训练T2V扩散模型时间注意力单元的正则化技术，可实现对视频动力学的控制。实验证明，与现有开源T2V模型相比，本方法在生成更长、更具视觉吸引力的视频方面具有优越性。此外，我们通过分析使用/未使用VSTAR的时间注意力图，揭示了应用该方法以缓解对期望视觉时间变化忽略问题的重要性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日