Large-scale pre-trained diffusion models have exhibited remarkable capabilities in diverse video generations. Given a set of video clips of the same motion concept, the task of Motion Customization is to adapt existing text-to-video diffusion models to generate videos with this motion. For example, generating a video with a car moving in a prescribed manner under specific camera movements to make a movie, or a video illustrating how a bear would lift weights to inspire creators. Adaptation methods have been developed for customizing appearance like subject or style, yet unexplored for motion. It is straightforward to extend mainstream adaption methods for motion customization, including full model tuning, parameter-efficient tuning of additional layers, and Low-Rank Adaptions (LoRAs). However, the motion concept learned by these methods is often coupled with the limited appearances in the training videos, making it difficult to generalize the customized motion to other appearances. To overcome this challenge, we propose MotionDirector, with a dual-path LoRAs architecture to decouple the learning of appearance and motion. Further, we design a novel appearance-debiased temporal loss to mitigate the influence of appearance on the temporal training objective. Experimental results show the proposed method can generate videos of diverse appearances for the customized motions. Our method also supports various downstream applications, such as the mixing of different videos with their appearance and motion respectively, and animating a single image with customized motions. Our code and model weights will be released.
翻译:大规模预训练扩散模型在多样化视频生成中展现出卓越能力。给定一组具有相同运动概念的短视频片段,运动定制任务旨在使现有文本到视频扩散模型能够生成包含该运动模式的视频。例如,生成汽车按照指定方式在特定摄像机运动下行驶的视频以制作电影,或展示熊如何进行举重动作的视频来启发创作者。现有适配方法已开发用于外观定制(如主体或风格),但在运动定制方面尚待探索。虽然可直观地将主流适配方法(包括全模型微调、附加层的参数高效微调以及低秩适配(LoRA))扩展至运动定制,但这些方法习得的运动概念常与训练视频中的有限外观耦合,导致难以将定制运动泛化至其他外观。为突破这一瓶颈,我们提出MotionDirector,采用双路径LoRA架构解耦外观与运动的学习。进一步,我们设计了新型外观去偏时序损失函数,以削弱外观对时序训练目标的影响。实验结果表明,所提方法能够为定制运动生成具有多样外观的视频。我们的方法还支持多种下游应用,例如分别对视频外观和运动进行混合,以及将定制运动应用于单张图像的动画生成。相关代码与模型权重将开源发布。