Current video generative foundation models primarily focus on text-to-video tasks, providing limited control for fine-grained video content creation. Although adapter-based approaches (e.g., ControlNet) enable additional controls with minimal fine-tuning, they encounter challenges when integrating multiple conditions, including: branch conflicts between independently trained adapters, parameter redundancy leading to increased computational cost, and suboptimal performance compared to full fine-tuning. To address these challenges, we introduce FullDiT, a unified foundation model for video generation that seamlessly integrates multiple conditions via unified full-attention mechanisms. By fusing multi-task conditions into a unified sequence representation and leveraging the long-context learning ability of full self-attention to capture condition dynamics, FullDiT reduces parameter overhead, avoids conditions conflict, and shows scalability and emergent ability. We further introduce FullBench for multi-task video generation evaluation. Experiments demonstrate that FullDiT achieves state-of-the-art results, highlighting the efficacy of full-attention in complex multi-task video generation.
翻译:当前视频生成基础模型主要集中于文本到视频任务,对细粒度视频内容创作的控制能力有限。虽然基于适配器的方法(如ControlNet)能够通过少量微调实现额外控制,但在整合多条件时面临诸多挑战,包括:独立训练适配器间的分支冲突、参数冗余导致计算成本增加,以及性能相较于全参数微调存在差距。为应对这些挑战,我们提出FullDiT——一个通过统一全注意力机制无缝整合多条件的视频生成统一基础模型。该方法将多任务条件融合为统一的序列表示,并利用全自注意力的长上下文学习能力捕捉条件动态,从而减少参数开销、避免条件冲突,同时展现出良好的可扩展性与涌现能力。我们进一步构建了用于多任务视频生成评估的FullBench基准。实验表明,FullDiT取得了最先进的性能,充分验证了全注意力机制在复杂多任务视频生成中的有效性。