Recent advancements in 3D foundation models have enabled the generation of high-fidelity assets, yet precise 3D manipulation remains a significant challenge. Existing 3D editing frameworks often face a difficult trade-off between visual controllability, geometric consistency, and scalability. Specifically, optimization-based methods are prohibitively slow, multi-view 2D propagation techniques suffer from visual drift, and training-free latent manipulation methods are inherently bound by frozen priors and cannot directly benefit from scaling. In this work, we present ShapeUP, a scalable, image-conditioned 3D editing framework that formulates editing as a supervised latent-to-latent translation within a native 3D representation. This formulation allows ShapeUP to build on a pretrained 3D foundation model, leveraging its strong generative prior while adapting it to editing through supervised training. In practice, ShapeUP is trained on triplets consisting of a source 3D shape, an edited 2D image, and the corresponding edited 3D shape, and learns a direct mapping using a 3D Diffusion Transformer (DiT). This image-as-prompt approach enables fine-grained visual control over both local and global edits and achieves implicit, mask-free localization, while maintaining strict structural consistency with the original asset. Our extensive evaluations demonstrate that ShapeUP consistently outperforms current trained and training-free baselines in both identity preservation and edit fidelity, offering a robust and scalable paradigm for native 3D content creation.
翻译:近期3D基础模型的进展已实现高保真资产的生成,但精确的3D操控仍面临重大挑战。现有3D编辑框架往往在视觉可控性、几何一致性与可扩展性之间存在艰难权衡:基于优化的方法耗时过长,多视图2D传播技术存在视觉漂移,而免训练的潜空间操控方法本质上受限于冻结先验,无法直接受益于规模扩展。本文提出ShapeUP——一个可扩展的图像条件3D编辑框架,将编辑任务形式化为原生3D表征下的有监督潜空间到潜空间映射。该形式化使ShapeUP能够构建于预训练3D基础模型之上,在利用其强大生成先验的同时,通过有监督训练使其适配编辑任务。实践中,ShapeUP基于由源3D形状、编辑后2D图像及对应编辑后3D形状构成的三元组进行训练,通过3D扩散Transformer(DiT)学习直接映射。这种图像即提示的方法既支持对局部与全局编辑的细粒度视觉控制,又能实现隐式、无掩模的定位,同时保持与原始资产严格的结构一致性。大量评估表明,无论在身份保持还是编辑保真度方面,ShapeUP均持续优于现有训练与非训练基线方法,为原生3D内容创作提供了稳健且可扩展的范式。