ShapeUP: Scalable Image-Conditioned 3D Editing

Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.

翻译：三维基础模型的最新进展已能生成高保真度的数字资产，但精确的三维操控仍面临重大挑战。现有的三维编辑框架往往需要在视觉可控性、几何一致性与可扩展性之间进行艰难权衡。具体而言，基于优化的方法计算速度极慢，多视图二维传播技术易产生视觉漂移，而免训练的潜在空间操作方法则受限于冻结的先验分布，无法直接受益于模型扩展。本研究提出ShapeUP——一种可扩展的图像条件三维编辑框架，它将编辑任务形式化为原生三维表征空间内的监督式潜在到潜在映射。该形式化使ShapeUP能够基于预训练的三维基础模型构建，在利用其强大生成先验的同时，通过监督训练使其适应编辑任务。实际训练中，ShapeUP使用包含源三维形状、编辑后二维图像及对应编辑三维形状的三元组数据进行训练，通过三维扩散Transformer（DiT）学习直接映射关系。这种“以图像为提示”的方法实现了对局部与全局编辑的细粒度视觉控制，达成了隐式无掩码定位，同时严格保持与原始资产的结构一致性。大量实验评估表明，ShapeUP在身份保持与编辑保真度方面均优于当前基于训练及免训练的基线方法，为原生三维内容创作提供了稳健且可扩展的范式。