Image customization has been extensively studied in text-to-image (T2I) diffusion models, leading to impressive outcomes and applications. With the emergence of text-to-video (T2V) diffusion models, its temporal counterpart, motion customization, has not yet been well investigated. To address the challenge of one-shot motion customization, we propose Customize-A-Video that models the motion from a single reference video and adapting it to new subjects and scenes with both spatial and temporal varieties. It leverages low-rank adaptation (LoRA) on temporal attention layers to tailor the pre-trained T2V diffusion model for specific motion modeling from the reference videos. To disentangle the spatial and temporal information during the training pipeline, we introduce a novel concept of appearance absorbers that detach the original appearance from the single reference video prior to motion learning. Our proposed method can be easily extended to various downstream tasks, including custom video generation and editing, video appearance customization, and multiple motion combination, in a plug-and-play fashion. Our project page can be found at https://anonymous-314.github.io.
翻译:图像定制化在文本到图像(T2I)扩散模型中已得到广泛研究,并取得了显著成果与应用。随着文本到视频(T2V)扩散模型的出现,其对应的时序维度——运动定制化——尚未得到充分探索。为解决单样本运动定制化的挑战,我们提出Customize-A-Video方法,从单一参考视频中建模运动,并将其适配至具有空间与时间变化的新主体与场景。该方法利用时序注意力层上的低秩适配(LoRA)技术,使预训练的T2V扩散模型能够针对参考视频中的特定运动进行建模。为在训练过程中解耦空间与时间信息,我们引入外观吸收器这一新概念,在运动学习前从单一参考视频中剥离原始外观。我们提出的方法可轻松扩展至多种下游任务,包括自定义视频生成与编辑、视频外观定制化以及多运动组合,且支持即插即用。项目页面详见https://anonymous-314.github.io。