3D editing has shown remarkable capability in editing scenes based on various instructions. However, existing methods struggle with achieving intuitive, localized editing, such as selectively making flowers blossom. Drag-style editing has shown exceptional capability to edit images with direct manipulation instead of ambiguous text commands. Nevertheless, extending drag-based editing to 3D scenes presents substantial challenges due to multi-view inconsistency. To this end, we introduce DragScene, a framework that integrates drag-style editing with diverse 3D representations. First, latent optimization is performed on a reference view to generate 2D edits based on user instructions. Subsequently, coarse 3D clues are reconstructed from the reference view using a point-based representation to capture the geometric details of the edits. The latent representation of the edited view is then mapped to these 3D clues, guiding the latent optimization of other views. This process ensures that edits are propagated seamlessly across multiple views, maintaining multi-view consistency. Finally, the target 3D scene is reconstructed from the edited multi-view images. Extensive experiments demonstrate that DragScene facilitates precise and flexible drag-style editing of 3D scenes, supporting broad applicability across diverse 3D representations.
翻译:三维编辑技术已展现出基于多种指令编辑场景的卓越能力。然而,现有方法难以实现直观的局部编辑,例如选择性使花朵绽放。拖拽式编辑已证明能够通过直接操作而非模糊的文本指令来编辑图像,表现出非凡的能力。然而,将基于拖拽的编辑扩展到三维场景面临多视图不一致性的重大挑战。为此,我们提出了DragScene框架,该框架将拖拽式编辑与多样化的三维表示相结合。首先,在参考视图上进行潜在优化,根据用户指令生成二维编辑结果。随后,使用基于点的表示从参考视图重建粗略的三维线索,以捕捉编辑的几何细节。接着,将编辑视图的潜在表示映射到这些三维线索,从而指导其他视图的潜在优化。这一过程确保编辑在多视图间无缝传播,保持多视图一致性。最后,从编辑后的多视图图像重建目标三维场景。大量实验表明,DragScene能够实现精确且灵活的拖拽式三维场景编辑,并支持跨多种三维表示的广泛适用性。