Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches designed for single subjects suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, we aim to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific object area, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark, with 69 individual subjects and 57 meaningful pairs. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method, compared with the previous state-of-the-art approaches.
翻译:定制化文本到视频生成旨在根据文本提示和主体参考生成高质量视频。当前针对单一主体设计的方法难以处理多主体场景,而这是一个更具挑战性且更贴近实际的应用场景。本文旨在推动多主体引导的文本到视频定制化生成。我们提出CustomVideo,一种新颖的框架,能够在多主体引导下生成保留身份特征的高质量视频。具体而言,首先,我们通过将多个主体组合到同一张图像中促进它们的共现。其次,在基础文本到视频扩散模型的基础上,我们设计了一种简单而有效的注意力控制策略,在扩散模型的潜在空间中解耦不同主体。此外,为帮助模型聚焦于特定物体区域,我们从给定的参考图像中分割目标物体,并为其提供对应的物体掩码用于注意力学习。同时,我们构建了一个多主体文本到视频生成数据集作为综合基准,包含69个独立主体和57个有意义的主体对。大量定性、定量及用户研究结果表明,与现有最优方法相比,本方法具有显著优越性。