This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules, allowing separate optimization of each component and leveraging the rich spatial-temporal representation inherited from transformers. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the dynamics of 3D objects over time. 2) It enables flexible conditioning information through simple concatenation in the token space, effectively unifying video generation and prediction tasks. 3) Its modularized design facilitates a spatial-temporal decoupled training strategy, leading to improved efficiency. Extensive experiments on video generation, prediction, and dynamics modeling (i.e., physics-based QA) tasks have been conducted to demonstrate the effectiveness of VDT in various scenarios, including autonomous driving, human action, and physics-based simulation. We hope our study on the capabilities of transformer-based video diffusion in capturing accurate temporal dependencies, handling conditioning information, and achieving efficient training will benefit future research and advance the field. Codes and models are available at https://github.com/RERV/VDT.
翻译:本工作提出视频扩散Transformer(VDT),率先将Transformer应用于基于扩散的视频生成。该模型采用包含模块化时序与空间注意力模块的Transformer块,可对每个组件进行独立优化,并充分利用Transformer继承而来的丰富时空表征能力。VDT具有以下显著优势:1)擅长捕捉时序依赖关系,生成时序一致的视频帧,甚至能够模拟三维物体随时间变化的动态特性;2)通过在令牌空间进行简单拼接实现灵活的条件信息注入,有效统一视频生成与预测任务;3)其模块化设计支持时空解耦训练策略,显著提升效率。我们针对视频生成、预测及动力学建模(即基于物理的问答)任务开展了大量实验,验证了VDT在自动驾驶、人体动作及物理模拟等多样化场景中的有效性。希望本研究关于基于Transformer的视频扩散在精确捕捉时序依赖、处理条件信息及高效训练方面的探索,能够为未来研究提供借鉴并推动该领域发展。代码与模型已发布于https://github.com/RERV/VDT。