We introduce InseRF, a novel method for generative object insertion in the NeRF reconstructions of 3D scenes. Based on a user-provided textual description and a 2D bounding box in a reference viewpoint, InseRF generates new objects in 3D scenes. Recently, methods for 3D scene editing have been profoundly transformed, owing to the use of strong priors of text-to-image diffusion models in 3D generative modeling. Existing methods are mostly effective in editing 3D scenes via style and appearance changes or removing existing objects. Generating new objects, however, remains a challenge for such methods, which we address in this study. Specifically, we propose grounding the 3D object insertion to a 2D object insertion in a reference view of the scene. The 2D edit is then lifted to 3D using a single-view object reconstruction method. The reconstructed object is then inserted into the scene, guided by the priors of monocular depth estimation methods. We evaluate our method on various 3D scenes and provide an in-depth analysis of the proposed components. Our experiments with generative insertion of objects in several 3D scenes indicate the effectiveness of our method compared to the existing methods. InseRF is capable of controllable and 3D-consistent object insertion without requiring explicit 3D information as input. Please visit our project page at https://mohamad-shahbazi.github.io/inserf.
翻译:我们提出InseRF,一种用于在3D场景的NeRF重建中进行生成式物体插入的新方法。基于用户提供的文本描述和参考视角下的2D边界框,InseRF能够在3D场景中生成新物体。近年来,得益于文本到图像扩散模型在3D生成建模中的强大先验知识,3D场景编辑方法经历了深刻变革。现有方法主要通过风格和外观变化或移除现有物体来有效编辑3D场景。然而,生成新物体仍是这些方法面临的挑战,本研究正是为了解决这一问题。具体而言,我们提出将3D物体插入任务锚定到场景参考视角下的2D物体插入,随后利用单视角物体重建方法将2D编辑提升至3D。重建的物体在单目深度估计方法的先验引导下被插入场景。我们在多种3D场景上评估了该方法,并对所提出的各组件进行了深入分析。在多个3D场景中进行的生成式物体插入实验表明,与现有方法相比,我们的方法具有有效性。InseRF能够实现可控且与3D一致的物体插入,无需输入显式的3D信息。请访问我们的项目页面:https://mohamad-shahbazi.github.io/inserf。