Diffusion models have demonstrated great success in text-to-video (T2V) generation. However, existing methods may face challenges when handling complex (long) video generation scenarios that involve multiple objects or dynamic changes in object numbers. To address these limitations, we propose VideoTetris, a novel framework that enables compositional T2V generation. Specifically, we propose spatio-temporal compositional diffusion to precisely follow complex textual semantics by manipulating and composing the attention maps of denoising networks spatially and temporally. Moreover, we propose an enhanced video data preprocessing to enhance the training data regarding motion dynamics and prompt understanding, equipped with a new reference frame attention mechanism to improve the consistency of auto-regressive video generation. Extensive experiments demonstrate that our VideoTetris achieves impressive qualitative and quantitative results in compositional T2V generation. Code is available at: https://github.com/YangLing0818/VideoTetris
翻译:扩散模型在文本到视频(T2V)生成领域已取得显著成功。然而,现有方法在处理涉及多个物体或物体数量动态变化的复杂(长)视频生成场景时仍面临挑战。为解决这些局限,我们提出VideoTetris——一个支持组合式T2V生成的新型框架。具体而言,我们提出时空组合扩散方法,通过从空间和时间维度操纵并组合去噪网络的注意力图,精准遵循复杂文本语义。此外,我们提出增强型视频数据预处理技术,提升训练数据在运动动态和提示理解方面的质量,并配备新型参考帧注意力机制以改进自回归视频生成的一致性。大量实验表明,我们的VideoTetris在组合式T2V生成中取得了令人瞩目的定性和定量结果。代码开源地址:https://github.com/YangLing0818/VideoTetris