Text-to-video (T2V) generation is a rapidly growing research area that aims to translate the scenes, objects, and actions within complex video text into a sequence of coherent visual frames. We present FlowZero, a novel framework that combines Large Language Models (LLMs) with image diffusion models to generate temporally-coherent videos. FlowZero uses LLMs to understand complex spatio-temporal dynamics from text, where LLMs can generate a comprehensive dynamic scene syntax (DSS) containing scene descriptions, object layouts, and background motion patterns. These elements in DSS are then used to guide the image diffusion model for video generation with smooth object motions and frame-to-frame coherence. Moreover, FlowZero incorporates an iterative self-refinement process, enhancing the alignment between the spatio-temporal layouts and the textual prompts for the videos. To enhance global coherence, we propose enriching the initial noise of each frame with motion dynamics to control the background movement and camera motion adaptively. By using spatio-temporal syntaxes to guide the diffusion process, FlowZero achieves improvement in zero-shot video synthesis, generating coherent videos with vivid motion.
翻译:文本至视频(T2V)生成是一个快速发展的研究领域,旨在将复杂视频文本中的场景、物体和动作转化为连贯的视觉帧序列。我们提出FlowZero,一种结合大语言模型(LLMs)与图像扩散模型的新型框架,用于生成时间连贯的视频。FlowZero利用LLMs理解文本中的复杂时空动态,其中LLMs可生成包含场景描述、物体布局及背景运动模式的综合动态场景语法(DSS)。DSS中的这些元素随后用于引导图像扩散模型进行视频生成,实现平滑的物体运动及帧间连贯性。此外,FlowZero引入迭代自我优化过程,增强时空布局与视频文本提示之间的对齐。为提升整体连贯性,我们提出通过运动动态丰富每帧的初始噪声,自适应地控制背景移动与相机运动。通过利用时空语法引导扩散过程,FlowZero在零样本视频合成中实现性能提升,生成运动生动且连贯的视频。