In this work, we present MotionBooth, an innovative framework designed for animating customized subjects with precise control over both object and camera movements. By leveraging a few images of a specific object, we efficiently fine-tune a text-to-video model to capture the object's shape and attributes accurately. Our approach presents subject region loss and video preservation loss to enhance the subject's learning performance, along with a subject token cross-attention loss to integrate the customized subject with motion control signals. Additionally, we propose training-free techniques for managing subject and camera motions during inference. In particular, we utilize cross-attention map manipulation to govern subject motion and introduce a novel latent shift module for camera movement control as well. MotionBooth excels in preserving the appearance of subjects while simultaneously controlling the motions in generated videos. Extensive quantitative and qualitative evaluations demonstrate the superiority and effectiveness of our method. Our project page is at https://jianzongwu.github.io/projects/motionbooth
翻译:本文提出MotionBooth,一个创新的框架,旨在通过精确控制物体和摄像机运动来实现定制化主体的动画生成。利用特定物体的少量图像,我们高效地对文本到视频模型进行微调,以准确捕捉物体的形状与属性。我们的方法提出了主体区域损失和视频保持损失来增强主体的学习性能,同时通过主体标记交叉注意力损失将定制化主体与运动控制信号相融合。此外,我们提出了在推理阶段管理主体与摄像机运动的免训练技术。具体而言,我们利用交叉注意力图操控来主导主体运动,并引入一种新颖的潜在偏移模块以实现摄像机运动控制。MotionBooth在保持生成视频中主体外观的同时,能够精确控制其运动。大量的定量与定性评估证明了我们方法的优越性与有效性。项目页面位于 https://jianzongwu.github.io/projects/motionbooth