We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
翻译:我们提出Lumiere——一种专为合成展现逼真、多样且连贯运动的视频而设计的文本到视频扩散模型,这是视频合成中的关键挑战。为此,我们引入一种时空UNet架构,该架构通过模型单次前向传播即可生成视频的完整时间时长。这与现有视频模型形成对比:现有模型先合成远程关键帧,再执行时间超分辨率——这种方法本质上难以实现全局时间一致性。通过部署空间及(关键地)时间下采样与上采样,并利用预训练的文本到图像扩散模型,我们的模型通过在多时空尺度上处理视频,直接学习生成全帧率低分辨率视频。我们展示了最先进的文本到视频生成结果,并表明我们的设计能轻松支持包括图像到视频、视频修复和风格化生成在内的广泛内容创作任务与视频编辑应用。