Text-based 2D image editing models have recently reached an impressive level of maturity, motivating a growing body of work that heavily depends on these models to drive 3D edits. While effective for appearance-based modifications, such 2D-centric 3D editing pipelines often struggle with fine-grained 3D editing, where localized structural changes must be applied while strictly preserving an object's overall identity. To address this limitation, we propose Prox-E, a training-free framework that enables fine-grained 3D control through an explicit, primitive-based geometric abstraction. Our framework first abstracts an input 3D shape into a compact set of geometric primitives. A pretrained vision-language model (VLM) then edits this abstraction to specify primitive-level changes. These structural edits are subsequently used to guide a 3D generative model, enabling fine-grained, localized modifications while preserving unchanged regions of the original shape. Through extensive experiments, we demonstrate that our method consistently balances identity preservation, shape quality, and instruction fidelity more effectively than various existing approaches, including 2D-based 3D editors and training-based methods.
翻译:基于文本的二维图像编辑模型近期已达到令人瞩目的成熟度,这促使大量工作深度依赖此类模型来驱动三维编辑。尽管基于外观的修改效果显著,但这种以二维为中心的三维编辑流程在实施细粒度三维编辑时往往面临挑战——这类编辑需在严格保持物体整体特征的前提下完成局部结构变化。为此,我们提出Prox-E这一免训练框架,通过显式的基元几何抽象实现细粒度三维控制。该框架首先将输入三维形状抽象为紧凑的几何基元集合,再由预训练视觉语言模型(VLM)编辑该抽象表示以指定基元级变化。这些结构编辑随后用于引导三维生成模型,在保留原始形状未改区域的同时实现精细的局部修改。通过大量实验证明,相较于包括基于二维的三维编辑器和基于训练的方法在内的多种现有方案,我们的方法在特征保持、形状质量和指令忠实度之间实现了更优的平衡。