多主体开放集视频生成个性化 (Multi-subject Open-set Personalization in Video Generation)

Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.

翻译：视频个性化方法使我们能够合成包含特定概念（如人物、宠物和地点）的视频。然而，现有方法通常局限于特定领域，需要对每个主体进行耗时的优化，或仅支持单一主体。本文提出Video Alchemist——一种具备内置多主体开放集个性化能力的视频模型，能够同时对前景对象和背景进行个性化，无需耗时的测试时优化。我们的模型基于一种新的扩散Transformer模块构建，该模块通过交叉注意力层融合每个条件参考图像及其对应的主体级文本提示。开发此类大型模型面临两个主要挑战：数据集与评估。首先，由于参考图像与视频的配对数据集极难收集，我们通过采样选定视频帧作为参考图像，并合成目标视频片段。然而，尽管模型在给定参考帧时能够轻松对训练视频进行去噪，却难以泛化至新场景。为缓解此问题，我们设计了一种结合广泛图像增强的自动数据构建流程。其次，开放集视频个性化评估本身即是一项挑战。为此，我们引入了一个专注于主体保真度并支持多样化个性化场景的个性化基准。最终，大量实验表明，我们的方法在定量与定性评估中均显著优于现有个性化方法。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日