CustomVideo：基于多主体引导的文本到视频生成定制化方法 (CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects)

Customized text-to-video generation aims to generate high-quality videos guided by text prompts and subject references. Current approaches for personalizing text-to-video generation suffer from tackling multiple subjects, which is a more challenging and practical scenario. In this work, our aim is to promote multi-subject guided text-to-video customization. We propose CustomVideo, a novel framework that can generate identity-preserving videos with the guidance of multiple subjects. To be specific, firstly, we encourage the co-occurrence of multiple subjects via composing them in a single image. Further, upon a basic text-to-video diffusion model, we design a simple yet effective attention control strategy to disentangle different subjects in the latent space of diffusion model. Moreover, to help the model focus on the specific area of the object, we segment the object from given reference images and provide a corresponding object mask for attention learning. Also, we collect a multi-subject text-to-video generation dataset as a comprehensive benchmark. Extensive qualitative, quantitative, and user study results demonstrate the superiority of our method compared to previous state-of-the-art approaches. The project page is https://kyfafyd.wang/projects/customvideo.

翻译：定制化文本到视频生成旨在通过文本提示与主体参考生成高质量视频。当前个性化文本到视频生成方法在处理多主体场景时存在局限，而多主体场景更具挑战性与实际应用价值。本工作旨在推动多主体引导的文本到视频定制化研究。我们提出CustomVideo，一种能够基于多主体引导生成保持身份一致性的视频新框架。具体而言，首先通过将多个主体组合于单张图像中，促进多主体协同出现。进一步，在基础文本到视频扩散模型之上，我们设计了一种简洁高效的注意力控制策略，以在扩散模型潜空间中解耦不同主体。此外，为帮助模型聚焦于目标对象的特定区域，我们从给定参考图像中分割对象，并提供对应的对象掩码以辅助注意力学习。同时，我们构建了一个多主体文本到视频生成数据集作为综合基准。大量定性、定量及用户研究结果表明，相较于现有先进方法，本方法具有显著优越性。项目页面详见 https://kyfafyd.wang/projects/customvideo。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日