Customized generation using diffusion models has made impressive progress in image generation, but remains unsatisfactory in the challenging video generation task, as it requires the controllability of both subjects and motions. To that end, we present DreamVideo, a novel approach to generating personalized videos from a few static images of the desired subject and a few videos of target motion. DreamVideo decouples this task into two stages, subject learning and motion learning, by leveraging a pre-trained video diffusion model. The subject learning aims to accurately capture the fine appearance of the subject from provided images, which is achieved by combining textual inversion and fine-tuning of our carefully designed identity adapter. In motion learning, we architect a motion adapter and fine-tune it on the given videos to effectively model the target motion pattern. Combining these two lightweight and efficient adapters allows for flexible customization of any subject with any motion. Extensive experimental results demonstrate the superior performance of our DreamVideo over the state-of-the-art methods for customized video generation. Our project page is at https://dreamvideo-t2v.github.io.
翻译:基于扩散模型的定制化生成技术在图像生成领域取得了显著进展,但在更具挑战性的视频生成任务中仍不尽如人意,原因在于该任务需要同时实现对主体和运动的可控性。为此,我们提出DreamVideo——一种从目标主体的少量静态图像和目标运动的少量视频中生成个性化视频的新方法。DreamVideo利用预训练视频扩散模型,将任务解耦为两个阶段:主体学习与运动学习。主体学习阶段通过文本反转与我们精心设计的身份适配器微调相结合,精准捕捉所提供图像中主体的精细外观;运动学习阶段则构建运动适配器,并在给定视频上进行微调,从而有效建模目标运动模式。将这两个轻量高效的适配器相结合,即可灵活定制任意主体与任意运动。大量实验结果表明,我们的DreamVideo在定制化视频生成任务中显著优于现有最先进方法。项目页面:https://dreamvideo-t2v.github.io。