Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
翻译:基于文本的二维图像编辑模型近期已达到令人瞩目的成熟度,这促使大量工作高度依赖这些模型来驱动三维编辑。尽管此类以二维为中心的三维编辑管线在基于外观的修改上效果显著,但在需严格保持物体整体身份的同时执行局部结构变化的细粒度三维编辑任务中往往力有不逮。为解决这一局限,我们提出Prox-E——一种无需训练的框架,通过显式的基元几何抽象实现细粒度三维控制。该框架首先将输入三维形状抽象为紧凑的几何基元集合,随后利用预训练的视觉语言模型(VLM)编辑此抽象以指定基元级变化。这些结构编辑进一步用于引导三维生成模型,在保持原始形状未修改区域的同时实现细粒度局部修改。通过大量实验,我们证明该方法在身份保持、形状质量与指令忠实度这三项指标上的平衡表现始终优于多种现有方法,包括基于二维的三维编辑器和基于训练的方法。