Large scale text-guided diffusion models have garnered significant attention due to their ability to synthesize diverse images that convey complex visual concepts. This generative power has more recently been leveraged to perform text-to-3D synthesis. In this work, we present a technique that harnesses the power of latent diffusion models for editing existing 3D objects. Our method takes oriented 2D images of a 3D object as input and learns a grid-based volumetric representation of it. To guide the volumetric representation to conform to a target text prompt, we follow unconditional text-to-3D methods and optimize a Score Distillation Sampling (SDS) loss. However, we observe that combining this diffusion-guided loss with an image-based regularization loss that encourages the representation not to deviate too strongly from the input object is challenging, as it requires achieving two conflicting goals while viewing only structure-and-appearance coupled 2D projections. Thus, we introduce a novel volumetric regularization loss that operates directly in 3D space, utilizing the explicit nature of our 3D representation to enforce correlation between the global structure of the original and edited object. Furthermore, we present a technique that optimizes cross-attention volumetric grids to refine the spatial extent of the edits. Extensive experiments and comparisons demonstrate the effectiveness of our approach in creating a myriad of edits which cannot be achieved by prior works.
翻译:大规模文本引导扩散模型因能够合成传达复杂视觉概念的多样化图像而备受关注。这种生成能力最近被应用于实现文本到3D的合成。本文提出一种技术,利用潜在扩散模型对现有3D对象进行编辑。我们的方法以3D对象的定向2D图像为输入,学习其基于网格的体素表示。为引导体素表示符合目标文本提示,我们遵循无条件的文本到3D方法并优化得分蒸馏采样损失。然而,我们发现将此扩散引导损失与鼓励表示不偏离输入对象过多的基于图像的正则化损失相结合颇具挑战性——因为需要在仅观察结构-外观耦合的2D投影的同时实现两个冲突目标。为此,我们提出一种直接在3D空间运行的新型体素正则化损失,利用3D表示的显式特性来强制原始对象与编辑对象全局结构之间的关联。此外,我们提出一种优化交叉注意力体素网格的技术,以细化编辑的空间范围。大量实验与对比表明,该方法能有效创建先前工作无法实现的多样化编辑效果。