We propose to train a subject-driven customized video generation model through decoupling the subject-specific learning from temporal dynamics in zero-shot without additional tuning. A traditional method for video customization that is tuning-free often relies on large, annotated video datasets, which are computationally expensive and require extensive annotation. In contrast to the previous approach, we introduce the use of an image customization dataset directly on training video customization models, factorizing the video customization into two folds: (1) identity injection through image customization dataset and (2) temporal modeling preservation with a small set of unannotated videos through the image-to-video training method. Additionally, we employ random image token dropping with randomized image initialization during image-to-video fine-tuning to mitigate the copy-and-paste issue. To further enhance learning, we introduce stochastic switching during joint optimization of subject-specific and temporal features, mitigating catastrophic forgetting. Our method achieves strong subject consistency and scalability, outperforming existing video customization models in zero-shot settings, demonstrating the effectiveness of our framework.
翻译:我们提出通过零样本条件下将主体特定学习与时序动态解耦来训练主体驱动的定制化视频生成模型,无需额外调优。传统无需调优的视频定制方法通常依赖于大规模标注视频数据集,这些数据集计算成本高昂且需要大量标注。与先前方法不同,我们引入直接使用图像定制数据集训练视频定制模型的方法,将视频定制分解为两个层面:(1)通过图像定制数据集实现身份注入,(2)通过图像到视频训练方法利用少量未标注视频保持时序建模能力。此外,我们在图像到视频微调阶段采用随机图像令牌丢弃与随机化图像初始化策略,以缓解复制粘贴问题。为进一步增强学习效果,我们在主体特定特征与时序特征的联合优化中引入随机切换机制,有效减轻灾难性遗忘。我们的方法在零样本设置下实现了优异的主体一致性与可扩展性,性能超越现有视频定制模型,证明了本框架的有效性。