DisenStudio: Customized Multi-subject Text-to-Video Generation with Disentangled Spatial Control

Generating customized content in videos has received increasing attention recently. However, existing works primarily focus on customized text-to-video generation for single subject, suffering from subject-missing and attribute-binding problems when the video is expected to contain multiple subjects. Furthermore, existing models struggle to assign the desired actions to the corresponding subjects (action-binding problem), failing to achieve satisfactory multi-subject generation performance. To tackle the problems, in this paper, we propose DisenStudio, a novel framework that can generate text-guided videos for customized multiple subjects, given few images for each subject. Specifically, DisenStudio enhances a pretrained diffusion-based text-to-video model with our proposed spatial-disentangled cross-attention mechanism to associate each subject with the desired action. Then the model is customized for the multiple subjects with the proposed motion-preserved disentangled finetuning, which involves three tuning strategies: multi-subject co-occurrence tuning, masked single-subject tuning, and multi-subject motion-preserved tuning. The first two strategies guarantee the subject occurrence and preserve their visual attributes, and the third strategy helps the model maintain the temporal motion-generation ability when finetuning on static images. We conduct extensive experiments to demonstrate our proposed DisenStudio significantly outperforms existing methods in various metrics. Additionally, we show that DisenStudio can be used as a powerful tool for various controllable generation applications.

翻译：近年来，视频中定制化内容的生成日益受到关注。然而，现有工作主要集中于单一主体的定制化文本到视频生成，当视频预期包含多个主体时，会面临主体缺失和属性绑定问题。此外，现有模型难以将期望的动作分配给对应主体（动作绑定问题），无法实现令人满意的多主体生成性能。为解决这些问题，本文提出DisenStudio，一个新颖的框架，能够在给定每个主体少量图像的情况下，生成针对定制化多主体的文本引导视频。具体而言，DisenStudio通过我们提出的空间解耦交叉注意力机制增强预训练的基于扩散的文本到视频模型，以将每个主体与期望的动作相关联。随后，通过提出的运动保持解耦微调对模型进行多主体定制，该过程包含三种调优策略：多主体共现调优、掩码单主体调优以及多主体运动保持调优。前两种策略保证了主体出现并保持其视觉属性，第三种策略帮助模型在静态图像上微调时保持时序运动生成能力。我们进行了大量实验，证明我们提出的DisenStudio在各种指标上显著优于现有方法。此外，我们展示了DisenStudio可作为各种可控生成应用的强大工具。