Creating a vivid video from the event or scenario in our imagination is a truly fascinating experience. Recent advancements in text-to-video synthesis have unveiled the potential to achieve this with prompts only. While text is convenient in conveying the overall scene context, it may be insufficient to control precisely. In this paper, we explore customized video generation by utilizing text as context description and motion structure (e.g. frame-wise depth) as concrete guidance. Our method, dubbed Make-Your-Video, involves joint-conditional video generation using a Latent Diffusion Model that is pre-trained for still image synthesis and then promoted for video generation with the introduction of temporal modules. This two-stage learning scheme not only reduces the computing resources required, but also improves the performance by transferring the rich concepts available in image datasets solely into video generation. Moreover, we use a simple yet effective causal attention mask strategy to enable longer video synthesis, which mitigates the potential quality degradation effectively. Experimental results show the superiority of our method over existing baselines, particularly in terms of temporal coherence and fidelity to users' guidance. In addition, our model enables several intriguing applications that demonstrate potential for practical usage.
翻译:从想象中的事件或场景中生成生动的视频是一种真正令人着迷的体验。文本到视频合成的最新进展揭示了仅通过提示实现这一目标的潜力。虽然文本在传达整体场景上下文方面非常便捷,但可能不足以进行精确控制。本文探索了利用文本作为上下文描述、运动结构(例如逐帧深度)作为具体引导的定制化视频生成方法。我们提出的方法名为Make-Your-Video,其采用联合条件视频生成技术,基于预训练用于静态图像合成的潜在扩散模型,再通过引入时序模块将其提升为视频生成任务。这种两阶段学习方案不仅降低了所需计算资源,还通过将图像数据集中丰富的概念迁移到视频生成中,提升了性能。此外,我们采用一种简单而有效的因果注意力掩码策略来实现更长视频的合成,有效缓解了可能的质量退化问题。实验结果表明,我们的方法在时间一致性和对用户引导的忠实度方面显著优于现有基线模型。同时,该模型还支持多种具有实际应用潜力的有趣功能。