This paper proposes a novel and physically interpretable method for face editing based on arbitrary text prompts. Different from previous GAN-inversion-based face editing methods that manipulate the latent space of GANs, or diffusion-based methods that model image manipulation as a reverse diffusion process, we regard the face editing process as imposing vector flow fields on face images, representing the offset of spatial coordinates and color for each image pixel. Under the above-proposed paradigm, we represent the vector flow field in two ways: 1) explicitly represent the flow vectors with rasterized tensors, and 2) implicitly parameterize the flow vectors as continuous, smooth, and resolution-agnostic neural fields, by leveraging the recent advances of implicit neural representations. The flow vectors are iteratively optimized under the guidance of the pre-trained Contrastive Language-Image Pretraining~(CLIP) model by maximizing the correlation between the edited image and the text prompt. We also propose a learning-based one-shot face editing framework, which is fast and adaptable to any text prompt input. Our method can also be flexibly extended to real-time video face editing. Compared with state-of-the-art text-driven face editing methods, our method can generate physically interpretable face editing results with high identity consistency and image quality. Our code will be made publicly available.
翻译:本文提出了一种新颖且物理可解释的人脸编辑方法,该方法基于任意文本提示。与以往基于GAN逆映射的人脸编辑方法(通过操控GAN的隐空间)或基于扩散的方法(将图像操作建模为逆向扩散过程)不同,我们将人脸编辑过程视为在人脸图像上施加矢量流场,该流场表示每个图像像素的空间坐标偏移和颜色偏移。在上述提出的范式下,我们以两种方式表示矢量流场:1) 使用光栅化张量显式表示流向量;2) 借助隐式神经表征的最新进展,将流向量隐式参数化为连续、平滑且与分辨率无关的神经场。流向量在预训练的对比语言-图像预训练(CLIP)模型引导下通过最大化编辑图像与文本提示之间的相关性进行迭代优化。我们还提出了一种基于学习的单次人脸编辑框架,该框架速度快且能适应任意文本提示输入。我们的方法还可灵活扩展至实时视频人脸编辑。与最先进的文本驱动人脸编辑方法相比,我们的方法能够生成物理可解释的人脸编辑结果,并保持高身份一致性和图像质量。我们的代码将公开提供。