Existing diffusion-based video editing models have made gorgeous advances for editing attributes of a source video over time but struggle to manipulate the motion information while preserving the original protagonist's appearance and background. To address this, we propose MotionEditor, a diffusion model for video motion editing. MotionEditor incorporates a novel content-aware motion adapter into ControlNet to capture temporal motion correspondence. While ControlNet enables direct generation based on skeleton poses, it encounters challenges when modifying the source motion in the inverted noise due to contradictory signals between the noise (source) and the condition (reference). Our adapter complements ControlNet by involving source content to transfer adapted control signals seamlessly. Further, we build up a two-branch architecture (a reconstruction branch and an editing branch) with a high-fidelity attention injection mechanism facilitating branch interaction. This mechanism enables the editing branch to query the key and value from the reconstruction branch in a decoupled manner, making the editing branch retain the original background and protagonist appearance. We also propose a skeleton alignment algorithm to address the discrepancies in pose size and position. Experiments demonstrate the promising motion editing ability of MotionEditor, both qualitatively and quantitatively.
翻译:现有基于扩散的视频编辑模型在随时间编辑源视频属性方面取得了显著进展,但在保留原始主角外观和背景的同时操纵运动信息方面仍面临挑战。为解决这一问题,我们提出了MotionEditor——一种用于视频运动编辑的扩散模型。MotionEditor将新型内容感知运动适配器集成到ControlNet中,以捕捉时序运动对应关系。尽管ControlNet能够基于骨骼姿态直接生成,但在翻转噪声中修改源运动时会遇到困难,这是因为噪声(源)与条件(参考)之间存在矛盾信号。我们的适配器通过引入源内容来无缝传递适配控制信号,从而补充了ControlNet的不足。此外,我们构建了双分支架构(重建分支与编辑分支),并采用高保真注意力注入机制促进分支交互。该机制使编辑分支能够以解耦方式从重建分支查询键和值,从而保留原始背景和主角外观。我们还提出了骨骼对齐算法以解决姿态尺寸和位置的差异。实验从定性和定量两方面证明了MotionEditor出色的运动编辑能力。