Recent text-to-video generation approaches rely on computationally heavy training and require large-scale video datasets. In this paper, we introduce a new task of zero-shot text-to-video generation and propose a low-cost approach (without any training or optimization) by leveraging the power of existing text-to-image synthesis methods (e.g., Stable Diffusion), making them suitable for the video domain. Our key modifications include (i) enriching the latent codes of the generated frames with motion dynamics to keep the global scene and the background time consistent; and (ii) reprogramming frame-level self-attention using a new cross-frame attention of each frame on the first frame, to preserve the context, appearance, and identity of the foreground object. Experiments show that this leads to low overhead, yet high-quality and remarkably consistent video generation. Moreover, our approach is not limited to text-to-video synthesis but is also applicable to other tasks such as conditional and content-specialized video generation, and Video Instruct-Pix2Pix, i.e., instruction-guided video editing. As experiments show, our method performs comparably or sometimes better than recent approaches, despite not being trained on additional video data. Our code will be open sourced at: https://github.com/Picsart-AI-Research/Text2Video-Zero .
翻译:近期文本到视频生成方法依赖于计算密集的训练,并需要大规模视频数据集。本文提出零样本文本到视频生成这一新任务,通过利用现有文本到图像合成方法(如Stable Diffusion)的力量,提出了一种低成本方法(无需任何训练或优化),使其适用于视频领域。我们的关键改进包括:(i)用运动动力学丰富生成帧的潜编码,以保持全局场景和背景的时间一致性;(ii)使用每个帧对第一帧的新跨帧注意力机制重新编程帧级自注意力,以保留前景对象的上下文、外观和身份。实验表明,这种方法实现了低开销、高质量且高度一致的视频生成。此外,我们的方法不仅限于文本到视频合成,还可应用于其他任务,如条件视频生成、内容专用视频生成以及Video Instruct-Pix2Pix(即指令引导视频编辑)。正如实验所示,尽管未在额外视频数据上训练,我们的方法性能可与近期方法相媲美,有时甚至更优。我们的代码将开源在:https://github.com/Picsart-AI-Research/Text2Video-Zero 。