Recently, impressive results have been achieved in 3D scene editing with text instructions based on a 2D diffusion model. However, current diffusion models primarily generate images by predicting noise in the latent space, and the editing is usually applied to the whole image, which makes it challenging to perform delicate, especially localized, editing for 3D scenes. Inspired by recent 3D Gaussian splatting, we propose a systematic framework, named GaussianEditor, to edit 3D scenes delicately via 3D Gaussians with text instructions. Benefiting from the explicit property of 3D Gaussians, we design a series of techniques to achieve delicate editing. Specifically, we first extract the region of interest (RoI) corresponding to the text instruction, aligning it to 3D Gaussians. The Gaussian RoI is further used to control the editing process. Our framework can achieve more delicate and precise editing of 3D scenes than previous methods while enjoying much faster training speed, i.e. within 20 minutes on a single V100 GPU, more than twice as fast as Instruct-NeRF2NeRF (45 minutes -- 2 hours).
翻译:近期,基于二维扩散模型的三维场景文本指令编辑取得了显著成果。然而,现有扩散模型主要通过预测潜空间噪声生成图像,且编辑操作通常作用于整幅图像,难以实现三维场景的精细(尤其是局部)编辑。受最新三维高斯泼溅技术启发,我们提出系统性框架GaussianEditor,通过三维高斯体结合文本指令实现三维场景的精细编辑。利用三维高斯场的显式特性,我们设计了一系列技术达成精细编辑目标。具体而言,首先根据文本指令提取感兴趣区域(RoI),并将其与三维高斯场对齐;进而利用高斯RoI控制编辑过程。相较于现有方法,本框架不仅能实现更细致精准的三维场景编辑,同时享有更快的训练速度:单块V100 GPU上20分钟内即可完成训练,速度是Instruct-NeRF2NeRF(45分钟至2小时)的两倍以上。