Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.
翻译:大规模文本引导扩散模型因能合成传达复杂视觉概念的多样图像而备受关注。这种生成能力最近被用于实现文本到三维的合成。本文提出一种利用潜在扩散模型编辑现有三维物体的技术。我们的方法以三维物体的定向二维图像为输入,学习其基于网格的体素表示。为引导体素表示符合目标文本提示,我们遵循无条件文本到三维方法,优化得分蒸馏采样(SDS)损失。然而,我们观察到,将此扩散引导损失与鼓励表示不偏离输入对象的基于图像的正则化损失相结合具有挑战性,因为这需要在仅观察结构-外观耦合的二维投影时实现两个冲突目标。因此,我们提出一种新颖的体素正则化损失,该损失直接在三维空间中操作,利用三维表示的显式特性来强制约束原始物体与编辑后物体的全局结构相关性。此外,我们提出一种优化交叉注意力体素网格以细化编辑空间范围的技术。大量实验与比较表明,我们的方法在创建前人工作无法实现的大量编辑效果方面具有有效性。