Despite remarkable research advances in diffusion-based video editing, existing methods are limited to short-length videos due to the contradiction between long-range consistency and frame-wise editing. Recent approaches attempt to tackle this challenge by introducing video-2D representations to degrade video editing to image editing. However, they encounter significant difficulties in handling large-scale motion- and view-change videos especially for human-centric videos. This motivates us to introduce the dynamic Neural Radiance Fields (NeRF) as the human-centric video representation to ease the video editing problem to a 3D space editing task. As such, editing can be performed in the 3D spaces and propagated to the entire video via the deformation field. To provide finer and direct controllable editing, we propose the image-based 3D space editing pipeline with a set of effective designs. These include multi-view multi-pose Score Distillation Sampling (SDS) from both 2D personalized diffusion priors and 3D diffusion priors, reconstruction losses on the reference image, text-guided local parts super-resolution, and style transfer for 3D background space. Extensive experiments demonstrate that our method, dubbed as DynVideo-E, significantly outperforms SOTA approaches on two challenging datasets by a large margin of 50% ~ 95% in terms of human preference. Compelling video comparisons are provided in the project page https://showlab.github.io/DynVideo-E/. Our code and data will be released to the community.
翻译:尽管基于扩散模型的视频编辑研究取得了显著进展,但现有方法受限于短时长视频,这源于长程一致性与逐帧编辑之间的矛盾。近期研究尝试通过引入视频-二维表征将视频编辑降维为图像编辑来解决此挑战,但在处理大尺度运动与视角变化视频(尤其是人体中心视频)时仍面临重大困难。这促使我们引入动态神经辐射场作为人体中心视频表征,将视频编辑问题简化为三维空间编辑任务。由此,可在三维空间中执行编辑操作,并通过形变场传播至整个视频。为实现更精细且直接可控的编辑,我们提出基于图像的三维空间编辑流水线,并配备多项有效设计,包括:融合二维个性化扩散先验与三维扩散先验的多视角多姿态分数蒸馏采样、参考图像重建损失、文本引导的局部部件超分辨率,以及三维背景空间的风格迁移。大量实验表明,本方法(称为DynVideo-E)在两个具有挑战性的数据集上以50%~95%的人类偏好优势显著超越现有最优方法。项目页面https://showlab.github.io/DynVideo-E/提供了令人信服的视频对比。我们的代码与数据将向社区开放。