The emergence of diffusion models has greatly propelled the progress in image and video generation. Recently, some efforts have been made in controllable video generation, including text-to-video generation and video motion control, among which camera motion control is an important topic. However, existing camera motion control methods rely on training a temporal camera module, and necessitate substantial computation resources due to the large amount of parameters in video generation models. Moreover, existing methods pre-define camera motion types during training, which limits their flexibility in camera control. Therefore, to reduce training costs and achieve flexible camera control, we propose COMD, a novel training-free video motion transfer model, which disentangles camera motions and object motions in source videos and transfers the extracted camera motions to new videos. We first propose a one-shot camera motion disentanglement method to extract camera motion from a single source video, which separates the moving objects from the background and estimates the camera motion in the moving objects region based on the motion in the background by solving a Poisson equation. Furthermore, we propose a few-shot camera motion disentanglement method to extract the common camera motion from multiple videos with similar camera motions, which employs a window-based clustering technique to extract the common features in temporal attention maps of multiple videos. Finally, we propose a motion combination method to combine different types of camera motions together, enabling our model a more controllable and flexible camera control. Extensive experiments demonstrate that our training-free approach can effectively decouple camera-object motion and apply the decoupled camera motion to a wide range of controllable video generation tasks, achieving flexible and diverse camera motion control.
翻译:扩散模型的兴起极大地推动了图像与视频生成领域的发展。近期,可控视频生成(包括文本到视频生成及视频运动控制)方面取得了一定进展,其中相机运动控制是一个重要课题。然而,现有相机运动控制方法依赖于训练时序相机模块,且由于视频生成模型参数庞大,需要耗费大量计算资源。此外,现有方法在训练阶段预定义相机运动类型,这限制了其在相机控制中的灵活性。为降低训练成本并实现灵活的相机控制,我们提出COMD——一种无需训练的新型视频运动迁移模型,该模型可解耦源视频中的相机运动与物体运动,并将提取的相机运动迁移至新视频。我们首先提出单次镜头相机运动解耦方法,通过从单个源视频中提取相机运动:该方法分离运动物体与背景,并基于背景运动通过求解泊松方程估算运动物体区域的相机运动。其次,我们提出少样本相机运动解耦方法,通过从具有相似相机运动的多段视频中提取公共相机运动,该方法采用基于窗口的聚类技术提取多段视频时序注意力图中的公共特征。最后,我们提出运动组合方法,将不同类型相机运动进行融合,使模型实现更可控、更灵活的相机控制。大量实验证明,这种无需训练的方法能有效解耦相机-物体运动,并将解耦后的相机运动应用于广泛的可控视频生成任务,实现灵活多样的相机运动控制。