Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.
翻译:定制化文本到视频生成旨在通过文本提示与主体参考生成高质量视频。当前个性化文本到视频生成方法在处理多主体场景时存在局限,而多主体场景更具挑战性与实际应用价值。本工作旨在推动多主体引导的文本到视频定制化研究。我们提出CustomVideo,一种能够基于多主体引导生成保持身份一致性的视频新框架。具体而言,首先通过将多个主体组合于单张图像中,促进多主体协同出现。进一步,在基础文本到视频扩散模型之上,我们设计了一种简洁高效的注意力控制策略,以在扩散模型潜空间中解耦不同主体。此外,为帮助模型聚焦于目标对象的特定区域,我们从给定参考图像中分割对象,并提供对应的对象掩码以辅助注意力学习。同时,我们构建了一个多主体文本到视频生成数据集作为综合基准。大量定性、定量及用户研究结果表明,相较于现有先进方法,本方法具有显著优越性。项目页面详见 https://kyfafyd.wang/projects/customvideo。