Video personalization methods allow us to synthesize videos with specific concepts such as people, pets, and places. However, existing methods often focus on limited domains, require time-consuming optimization per subject, or support only a single subject. We present Video Alchemist $-$ a video model with built-in multi-subject, open-set personalization capabilities for both foreground objects and background, eliminating the need for time-consuming test-time optimization. Our model is built on a new Diffusion Transformer module that fuses each conditional reference image and its corresponding subject-level text prompt with cross-attention layers. Developing such a large model presents two main challenges: dataset and evaluation. First, as paired datasets of reference images and videos are extremely hard to collect, we sample selected video frames as reference images and synthesize a clip of the target video. However, while models can easily denoise training videos given reference frames, they fail to generalize to new contexts. To mitigate this issue, we design a new automatic data construction pipeline with extensive image augmentations. Second, evaluating open-set video personalization is a challenge in itself. To address this, we introduce a personalization benchmark that focuses on accurate subject fidelity and supports diverse personalization scenarios. Finally, our extensive experiments show that our method significantly outperforms existing personalization methods in both quantitative and qualitative evaluations.
翻译:视频个性化方法使我们能够合成包含特定概念(如人物、宠物和地点)的视频。然而,现有方法通常局限于特定领域,需要对每个主体进行耗时的优化,或仅支持单一主体。本文提出Video Alchemist——一种具备内置多主体开放集个性化能力的视频模型,能够同时对前景对象和背景进行个性化,无需耗时的测试时优化。我们的模型基于一种新的扩散Transformer模块构建,该模块通过交叉注意力层融合每个条件参考图像及其对应的主体级文本提示。开发此类大型模型面临两个主要挑战:数据集与评估。首先,由于参考图像与视频的配对数据集极难收集,我们通过采样选定视频帧作为参考图像,并合成目标视频片段。然而,尽管模型在给定参考帧时能够轻松对训练视频进行去噪,却难以泛化至新场景。为缓解此问题,我们设计了一种结合广泛图像增强的自动数据构建流程。其次,开放集视频个性化评估本身即是一项挑战。为此,我们引入了一个专注于主体保真度并支持多样化个性化场景的个性化基准。最终,大量实验表明,我们的方法在定量与定性评估中均显著优于现有个性化方法。