We consider the problem of editing 3D objects and scenes based on open-ended language instructions. The established paradigm to solve this problem is to use a 2D image generator or editor to guide the 3D editing process. However, this is often slow as it requires do update a computationally expensive 3D representations such as a neural radiance field, and to do so by using contradictory guidance from a 2D model which is inherently not multi-view consistent. We thus introduce the Direct Gaussian Editor (DGE), a method that addresses these issues in two ways. First, we modify a given high-quality image editor like InstructPix2Pix to be multi-view consistent. We do so by utilizing a training-free approach which integrates cues from the underlying 3D geometry of the scene. Second, given a multi-view consistent edited sequence of images of the object, we directly and efficiently optimize the 3D object representation, which is based on 3D Gaussian Splatting. Because it does not require to apply edits incrementally and iteratively, DGE is significantly more efficient than existing approaches, and comes with other perks such as allowing selective editing of parts of the scene.
翻译:我们考虑了基于开放式语言指令编辑3D物体和场景的问题。解决该问题的既定范式是利用2D图像生成器或编辑器来引导3D编辑过程。然而,这种方法通常速度缓慢,因为它需要更新计算成本高昂的3D表示(如神经辐射场),并且需要依据来自2D模型的矛盾性指导(该模型本质上不具备多视图一致性)来执行更新。为此,我们引入了Direct Gaussian Editor(DGE),该方法通过两种方式解决上述问题。首先,我们对InstructPix2Pix等给定的高质量图像编辑器进行修改,使其具备多视图一致性。我们通过一种无需训练的方法实现这一点,该方法整合了场景底层3D几何结构的线索。其次,在获得物体的一致性多视图编辑图像序列后,我们直接高效地优化基于3D高斯泼溅的3D物体表示。由于无需逐步迭代应用编辑,DGE在效率上显著优于现有方法,并具备其他优势,例如允许对场景的局部进行选择性编辑。