Edit-Your-Motion: Space-Time Diffusion Decoupling Learning for Video Motion Editing

Existing diffusion-based video editing methods have achieved impressive results in motion editing. Most of the existing methods focus on the motion alignment between the edited video and the reference video. However, these methods do not constrain the background and object content of the video to remain unchanged, which makes it possible for users to generate unexpected videos. In this paper, we propose a one-shot video motion editing method called Edit-Your-Motion that requires only a single text-video pair for training. Specifically, we design the Detailed Prompt-Guided Learning Strategy (DPL) to decouple spatio-temporal features in space-time diffusion models. DPL separates learning object content and motion into two training stages. In the first training stage, we focus on learning the spatial features (the features of object content) and breaking down the temporal relationships in the video frames by shuffling them. We further propose Recurrent-Causal Attention (RC-Attn) to learn the consistent content features of the object from unordered video frames. In the second training stage, we restore the temporal relationship in video frames to learn the temporal feature (the features of the background and object's motion). We also adopt the Noise Constraint Loss to smooth out inter-frame differences. Finally, in the inference stage, we inject the content features of the source object into the editing branch through a two-branch structure (editing branch and reconstruction branch). With Edit-Your-Motion, users can edit the motion of objects in the source video to generate more exciting and diverse videos. Comprehensive qualitative experiments, quantitative experiments and user preference studies demonstrate that Edit-Your-Motion performs better than other methods.

翻译：现有基于扩散的视频编辑方法在运动编辑方面取得了令人瞩目的成果。大多数现有方法侧重于编辑后视频与参考视频之间的运动对齐。然而，这些方法并未约束视频背景和物体内容保持不变，这使得用户可能生成意外视频。本文提出一种一次性视频运动编辑方法，名为Edit-Your-Motion，仅需单对文本-视频样本进行训练。具体而言，我们设计了详细提示引导学习策略（DPL），以在时空扩散模型中解耦时空特征。DPL将物体内容和运动的学习分为两个训练阶段。在第一阶段，我们专注于学习空间特征（物体内容的特征），并通过打乱视频帧来破坏帧间时间关系。进一步提出循环因果注意力（RC-Attn），从无序视频帧中学习物体的一致内容特征。在第二阶段，我们恢复视频帧的时间关系，以学习时间特征（背景和物体运动的特征）。同时采用噪声约束损失平滑帧间差异。最终在推理阶段，通过双分支结构（编辑分支和重建分支）将源物体的内容特征注入编辑分支。借助Edit-Your-Motion，用户可编辑源视频中物体的运动，生成更具趣味性和多样性的视频。全面的定性实验、定量实验及用户偏好研究表明，Edit-Your-Motion性能优于其他方法。