FlowZero: Zero-Shot Text-to-Video Synthesis with LLM-Driven Dynamic Scene Syntax

Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.

翻译：文本至视频（T2V）生成是一个快速发展的研究领域，旨在将复杂视频文本中的场景、物体和动作转化为连贯的视觉帧序列。我们提出FlowZero，一种结合大语言模型（LLMs）与图像扩散模型的新型框架，用于生成时间连贯的视频。FlowZero利用LLMs理解文本中的复杂时空动态，其中LLMs可生成包含场景描述、物体布局及背景运动模式的综合动态场景语法（DSS）。DSS中的这些元素随后用于引导图像扩散模型进行视频生成，实现平滑的物体运动及帧间连贯性。此外，FlowZero引入迭代自我优化过程，增强时空布局与视频文本提示之间的对齐。为提升整体连贯性，我们提出通过运动动态丰富每帧的初始噪声，自适应地控制背景移动与相机运动。通过利用时空语法引导扩散过程，FlowZero在零样本视频合成中实现性能提升，生成运动生动且连贯的视频。

相关内容

DSS

关注 479

决策支持系统（Decision Support Systems）期刊中发表的文章的共同主线是它们与支持增强决策制定的理论和技术问题的相关性。所涉及的领域可能包括基础、功能、接口、实现、影响和决策支持系统(DSS)的评估。手稿可以从不同的方法和方法学中获得，包括决策理论、经济学、计量经济学、统计学、计算机支持的协作工作、数据库管理、语言学、管理科学、数学建模、运营管理、认知科学、心理学、用户界面管理等。但是，一份侧重于对任何这些相关领域的直接贡献的手稿应提交给适合于特定领域的机构。官网地址：http://dblp.uni-trier.de/db/journals/dss/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日