This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation. It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers. We also propose a unified spatial-temporal mask modeling mechanism, seamlessly integrated with the model, to cater to diverse video generation scenarios. VDT offers several appealing benefits. 1) It excels at capturing temporal dependencies to produce temporally consistent video frames and even simulate the physics and dynamics of 3D objects over time. 2) It facilitates flexible conditioning information, \eg, simple concatenation in the token space, effectively unifying different token lengths and modalities. 3) Pairing with our proposed spatial-temporal mask modeling mechanism, it becomes a general-purpose video diffuser for harnessing a range of tasks, including unconditional generation, video prediction, interpolation, animation, and completion, etc. Extensive experiments on these tasks spanning various scenarios, including autonomous driving, natural weather, human action, and physics-based simulation, demonstrate the effectiveness of VDT. Additionally, we present comprehensive studies on how \model handles conditioning information with the mask modeling mechanism, which we believe will benefit future research and advance the field. Project page: https:VDT-2023.github.io
翻译:本文提出视频扩散Transformer(VDT),首次在扩散模型驱动的视频生成中引入Transformer架构。该模型采用模块化时空注意力机制的Transformer块,充分挖掘Transformer固有的丰富时空表征能力。我们同步提出与模型无缝集成的统一时空掩码建模机制,可适配多种视频生成场景。VDT具备三大核心优势:1)擅长捕捉时序依赖关系,生成时序一致性高的视频帧,甚至能模拟三维物体随时间演变的物理规律与动态特征;2)支持灵活的条件信息注入(如在令牌空间中直接拼接),有效统一不同令牌长度与模态间的差异;3)结合时空掩码建模机制后,成为通用型视频扩散器,可处理无条件生成、视频预测、插帧、动画生成及视频补全等任务。在自动驾驶、自然气象、人体动作及物理仿真等多场景的实验中,VDT验证了其有效性。此外,我们系统研究了模型如何通过掩码建模机制处理条件信息,这一成果预计将推动该领域的未来研究与发展。项目主页:https://VDT-2023.github.io