3D content manipulation is an important computer vision task with many real-world applications (e.g., product design, cartoon generation, and 3D Avatar editing). Recently proposed 3D GANs can generate diverse photorealistic 3D-aware contents using Neural Radiance fields (NeRF). However, manipulation of NeRF still remains a challenging problem since the visual quality tends to degrade after manipulation and suboptimal control handles such as 2D semantic maps are used for manipulations. While text-guided manipulations have shown potential in 3D editing, such approaches often lack locality. To overcome these problems, we propose Local Editing NeRF (LENeRF), which only requires text inputs for fine-grained and localized manipulation. Specifically, we present three add-on modules of LENeRF, the Latent Residual Mapper, the Attention Field Network, and the Deformation Network, which are jointly used for local manipulations of 3D features by estimating a 3D attention field. The 3D attention field is learned in an unsupervised way, by distilling the zero-shot mask generation capability of CLIP to the 3D space with multi-view guidance. We conduct diverse experiments and thorough evaluations both quantitatively and qualitatively.
翻译:三维内容操控是计算机视觉领域的一项重要任务,具有广泛的实际应用(如产品设计、卡通生成和三维虚拟形象编辑)。近期提出的三维生成对抗网络(3D GANs)能够利用神经辐射场(NeRF)生成多样化的逼真三维感知内容。然而,由于操控后视觉质量易下降且通常需要使用二维语义图等次优控制方式,NeRF的操控仍面临挑战。尽管文本引导的操控在三维编辑中展现出潜力,但此类方法往往缺乏局部性。为解决这些问题,我们提出局部编辑神经辐射场(LENeRF),该方法仅需文本输入即可实现精细且局部化的操控。具体而言,我们提出LENeRF的三个附加模块:潜在残差映射器、注意力场网络和形变网络。通过联合估计三维注意力场,这些模块可实现三维特征的局部操控。三维注意力场以无监督方式学习,通过将CLIP的零样本掩码生成能力与多视图引导结合蒸馏至三维空间。我们通过多样化的实验及定性与定量评估进行了全面验证。