We propose a novel feed-forward 3D editing framework called Shap-Editor. Prior research on editing 3D objects primarily concentrated on editing individual objects by leveraging off-the-shelf 2D image editing networks. This is achieved via a process called distillation, which transfers knowledge from the 2D network to 3D assets. Distillation necessitates at least tens of minutes per asset to attain satisfactory editing results, and is thus not very practical. In contrast, we ask whether 3D editing can be carried out directly by a feed-forward network, eschewing test-time optimisation. In particular, we hypothesise that editing can be greatly simplified by first encoding 3D objects in a suitable latent space. We validate this hypothesis by building upon the latent space of Shap-E. We demonstrate that direct 3D editing in this space is possible and efficient by building a feed-forward editor network that only requires approximately one second per edit. Our experiments show that Shap-Editor generalises well to both in-distribution and out-of-distribution 3D assets with different prompts, exhibiting comparable performance with methods that carry out test-time optimisation for each edited instance.
翻译:我们提出一种新颖的前馈三维编辑框架——Shap-Editor。以往三维物体编辑研究主要利用现成的二维图像编辑网络对单个物体进行编辑,其实现依赖于称为"蒸馏"的过程,将二维网络的知识迁移至三维资产。每次资产编辑至少需要数十分钟才能获得满意结果,因此实用性有限。相比之下,我们探索能否直接通过前馈网络实现三维编辑,从而避免测试时优化。具体而言,我们假设通过将三维物体编码至合适的潜在空间可大幅简化编辑任务。基于Shap-E的潜在空间验证这一假设,通过构建前馈编辑器网络证明在此空间中直接进行三维编辑的可行性与高效性——每次编辑仅需约一秒。实验表明,Shap-Editor在不同提示条件下对分布内与分布外三维资产均表现出良好的泛化能力,其性能与需要对每个编辑实例进行测试时优化的方法相当。