Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.
翻译:研究人员近期开始探索使用基于StyleGAN的模型进行真实图像编辑。其中一个特别有趣的应用是利用自然语言描述来指导编辑过程。现有的基于语言图像编辑方法要么采用实例级潜码优化,要么将预定义文本提示映射到潜空间中的某些编辑方向。然而,这些方法存在固有局限性:前者效率不高,而后者往往难以有效处理多属性变化。为解决这些不足,我们提出CLIPInverter——一种新颖的文本驱动图像编辑方法,能够高效可靠地实现多属性变化。该方法的核心是在预训练的GAN反演网络中集成新型轻量级文本条件适配层。我们证明,通过将初始反演步骤条件化于目标描述的CLIP嵌入,可以获得更成功的编辑方向。此外,我们利用CLIP引导的细化步骤对生成的残差潜码进行修正,从而进一步改善与文本提示的对齐。实验结果表明,在人脸、猫和鸟类等多个领域,我们的方法在操作准确性和照片真实感方面均优于现有方法,这已通过定性和定量结果得到验证。