DiTCtrl: Exploring Attention Control in Multi-Modal Diffusion Transformer for Tuning-Free Multi-Prompt Longer Video Generation

from arxiv, 19 pages, 19 figures, Project page: https://onevfall.github.io/project_page/ditctrl ; GitHub repository: https://github.com/TencentARC/DiTCtrl

Sora-like video generation models have achieved remarkable progress with a Multi-Modal Diffusion Transformer MM-DiT architecture. However, the current video generation models predominantly focus on single-prompt, struggling to generate coherent scenes with multiple sequential prompts that better reflect real-world dynamic scenarios. While some pioneering works have explored multi-prompt video generation, they face significant challenges including strict training data requirements, weak prompt following, and unnatural transitions. To address these problems, we propose DiTCtrl, a training-free multi-prompt video generation method under MM-DiT architectures for the first time. Our key idea is to take the multi-prompt video generation task as temporal video editing with smooth transitions. To achieve this goal, we first analyze MM-DiT's attention mechanism, finding that the 3D full attention behaves similarly to that of the cross/self-attention blocks in the UNet-like diffusion models, enabling mask-guided precise semantic control across different prompts with attention sharing for multi-prompt video generation. Based on our careful design, the video generated by DiTCtrl achieves smooth transitions and consistent object motion given multiple sequential prompts without additional training. Besides, we also present MPVBench, a new benchmark specially designed for multi-prompt video generation to evaluate the performance of multi-prompt generation. Extensive experiments demonstrate that our method achieves state-of-the-art performance without additional training.

翻译：Sora类视频生成模型通过多模态扩散Transformer（MM-DiT）架构取得了显著进展。然而，当前的视频生成模型主要集中于单提示生成，难以根据多个连续提示生成连贯的场景，而后者更能反映现实世界的动态场景。尽管一些开创性工作已探索多提示视频生成，但它们面临着严峻挑战，包括严格的训练数据要求、提示跟随能力弱以及过渡不自然。为解决这些问题，我们首次在MM-DiT架构下提出了DiTCtrl，一种免训练的多提示视频生成方法。我们的核心思想是将多提示视频生成任务视为具有平滑过渡的时序视频编辑。为实现这一目标，我们首先分析了MM-DiT的注意力机制，发现其3D全局注意力的行为类似于UNet类扩散模型中的交叉/自注意力块，从而能够通过注意力共享实现跨不同提示的掩码引导精确语义控制，以支持多提示视频生成。基于我们的精心设计，DiTCtrl生成的视频在给定多个连续提示的情况下，无需额外训练即可实现平滑过渡和一致的对象运动。此外，我们还提出了MPVBench，这是一个专门为多提示视频生成设计的新基准，用于评估多提示生成的性能。大量实验表明，我们的方法无需额外训练即可达到最先进的性能。