We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.
翻译:我们提出W.A.L.T,一种基于Transformer的扩散建模方法,用于生成逼真视频。本方法包含两个关键设计决策。首先,我们采用因果编码器在统一潜在空间内联合压缩图像与视频,实现跨模态的训练与生成。其次,为提升内存与训练效率,我们设计专用于联合空间与时空生成建模的窗口注意力架构。这些设计决策使我们在不使用无分类器引导的情况下,在视频(UCF-101和Kinetics-600)与图像(ImageNet)生成基准测试中均达到最优性能。最后,我们为文本到视频生成任务训练了三阶段级联模型:包含一个基础潜在视频扩散模型和两个视频超分辨率扩散模型,可生成分辨率为$512 \times 896$、帧率为8帧/秒的视频。