Recent research has demonstrated that the combination of pretrained diffusion models with neural radiance fields (NeRFs) has emerged as a promising approach for text-to-3D generation. Simply coupling NeRF with diffusion models will result in cross-view inconsistency and degradation of stylized view syntheses. To address this challenge, we propose the Edit-DiffNeRF framework, which is composed of a frozen diffusion model, a proposed delta module to edit the latent semantic space of the diffusion model, and a NeRF. Instead of training the entire diffusion for each scene, our method focuses on editing the latent semantic space in frozen pretrained diffusion models by the delta module. This fundamental change to the standard diffusion framework enables us to make fine-grained modifications to the rendered views and effectively consolidate these instructions in a 3D scene via NeRF training. As a result, we are able to produce an edited 3D scene that faithfully aligns to input text instructions. Furthermore, to ensure semantic consistency across different viewpoints, we propose a novel multi-view semantic consistency loss that extracts a latent semantic embedding from the input view as a prior, and aim to reconstruct it in different views. Our proposed method has been shown to effectively edit real-world 3D scenes, resulting in 25% improvement in the alignment of the performed 3D edits with text instructions compared to prior work.
翻译:近期的研究表明,预训练扩散模型与神经辐射场(NeRF)的结合已成为文本到三维生成领域的一种有效方法。然而,简单地将NeRF与扩散模型耦合会导致跨视角不一致及风格化视图合成的退化。为解决这一挑战,我们提出Edit-DiffNeRF框架,该框架由冻结的扩散模型、用于编辑扩散模型潜在语义空间的增量模块(delta module)以及NeRF组成。我们的方法无需为每个场景训练整个扩散模型,而是通过增量模块聚焦于冻结预训练扩散模型中的潜在语义空间编辑。这一对标准扩散框架的根本性改变,使我们能够对渲染视图进行细粒度调整,并通过NeRF训练在三维场景中有效整合这些指令。最终,我们能够生成与输入文本指令精确对齐的编辑后三维场景。此外,为确保不同视角间的语义一致性,我们提出一种新颖的多视角语义一致性损失函数,该函数从输入视角提取潜在语义嵌入作为先验,旨在不同视角中重建该嵌入。实验表明,所提方法可有效编辑真实三维场景,相较于先前工作,三维编辑与文本指令的对齐度提升了25%。