As recent advances in Neural Radiance Fields (NeRF) have enabled high-fidelity 3D face reconstruction and novel view synthesis, its manipulation also became an essential task in 3D vision. However, existing manipulation methods require extensive human labor, such as a user-provided semantic mask and manual attribute search unsuitable for non-expert users. Instead, our approach is designed to require a single text to manipulate a face reconstructed with NeRF. To do so, we first train a scene manipulator, a latent code-conditional deformable NeRF, over a dynamic scene to control a face deformation using the latent code. However, representing a scene deformation with a single latent code is unfavorable for compositing local deformations observed in different instances. As so, our proposed Position-conditional Anchor Compositor (PAC) learns to represent a manipulated scene with spatially varying latent codes. Their renderings with the scene manipulator are then optimized to yield high cosine similarity to a target text in CLIP embedding space for text-driven manipulation. To the best of our knowledge, our approach is the first to address the text-driven manipulation of a face reconstructed with NeRF. Extensive results, comparisons, and ablation studies demonstrate the effectiveness of our approach.
翻译:随着神经辐射场(NeRF)的最新进展实现了高保真度的三维人脸重建与新颖视角合成,其操控技术亦成为三维视觉领域的关键任务。然而,现有操控方法依赖大量人工操作(如用户提供的语义掩码和手动属性搜索),不适合非专业用户。为此,本文方法仅需单一文本即可操控经NeRF重建的人脸。具体而言,我们首先在动态场景上训练一个场景操控器——即潜在码条件可变形NeRF,通过潜在码控制人脸形变。但使用单一潜在码表征场景形变不利于组合不同实例中的局部形变。为此,我们提出的位置条件锚点合成器(PAC)通过学习用空间变化的潜在码表征操控场景。随后,其与场景操控器的渲染结果在CLIP嵌入空间中以高余弦相似度为目标文本进行优化,实现文本驱动操控。据我们所知,本文方法是首个针对NeRF重建人脸实现文本驱动操控的研究。大量实验、对比与消融研究验证了本方法的有效性。