With the success of Neural Radiance Field (NeRF) in 3D-aware portrait editing, a variety of works have achieved promising results regarding both quality and 3D consistency. However, these methods heavily rely on per-prompt optimization when handling natural language as editing instructions. Due to the lack of labeled human face 3D datasets and effective architectures, the area of human-instructed 3D-aware editing for open-world portraits in an end-to-end manner remains under-explored. To solve this problem, we propose an end-to-end diffusion-based framework termed InstructPix2NeRF, which enables instructed 3D-aware portrait editing from a single open-world image with human instructions. At its core lies a conditional latent 3D diffusion process that lifts 2D editing to 3D space by learning the correlation between the paired images' difference and the instructions via triplet data. With the help of our proposed token position randomization strategy, we could even achieve multi-semantic editing through one single pass with the portrait identity well-preserved. Besides, we further propose an identity consistency module that directly modulates the extracted identity signals into our diffusion process, which increases the multi-view 3D identity consistency. Extensive experiments verify the effectiveness of our method and show its superiority against strong baselines quantitatively and qualitatively. Source code and pre-trained models can be found on our project page: \url{https://mybabyyh.github.io/InstructPix2NeRF}.
翻译:随着神经辐射场(NeRF)在3D感知肖像编辑领域的成功,大量工作已在质量和3D一致性方面取得了令人瞩目的成果。然而,这些方法在处理自然语言编辑指令时严重依赖逐提示优化。由于缺乏标注的人脸3D数据集和高效架构,面向开放世界肖像的端到端人类指令驱动3D感知编辑领域仍待探索。为解决此问题,我们提出名为InstructPix2NeRF的端到端扩散框架,该框架能够根据人类指令对单张开放世界图像进行指令驱动的3D感知肖像编辑。其核心是条件隐式3D扩散过程,通过三元组数据学习配对图像差异与指令之间的关联,从而将2D编辑提升至3D空间。借助所提出的标记位置随机化策略,我们甚至能通过单次前向传播实现多语义编辑,同时良好保持肖像身份特征。此外,我们进一步提出身份一致性模块,该模块直接将提取的身份信号调制到扩散过程中,从而增强多视角3D身份一致性。大量实验验证了本方法的有效性,并在定量与定性对比中展现出相较于强基线的优越性。源代码与预训练模型可在项目主页获取:\url{https://mybabyyh.github.io/InstructPix2NeRF}。