Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).
翻译:近年来,基于二维扩散模型的三维场景文本指令编辑取得了令人瞩目的成果。然而,当前扩散模型主要通过预测潜在空间噪声来生成图像,且编辑通常作用于整幅图像,这使得对三维场景进行精细(尤其是局部)编辑面临挑战。受近期三维高斯泼溅技术的启发,我们提出了一个系统化框架——高斯编辑器,通过三维高斯模型与文本指令实现对三维场景的精细编辑。得益于三维高斯模型的显式特性,我们设计了一系列技术以实现精细编辑。具体而言,我们首先提取与文本指令对应的感兴趣区域,并将其与三维高斯模型对齐。高斯感兴趣区域进一步用于控制编辑过程。与现有方法相比,我们的框架能够实现更精细、更精确的三维场景编辑,同时具备更快的训练速度——在单块V100 GPU上训练时间不超过20分钟,比Instruct-NeRF2NeRF(45分钟至2小时)快两倍以上。