Researchers have recently begun exploring the use of StyleGAN-based models for real image editing. One particularly interesting application is using natural language descriptions to guide the editing process. Existing approaches for editing images using language either resort to instance-level latent code optimization or map predefined text prompts to some editing directions in the latent space. However, these approaches have inherent limitations. The former is not very efficient, while the latter often struggles to effectively handle multi-attribute changes. To address these weaknesses, we present CLIPInverter, a new text-driven image editing approach that is able to efficiently and reliably perform multi-attribute changes. The core of our method is the use of novel, lightweight text-conditioned adapter layers integrated into pretrained GAN-inversion networks. We demonstrate that by conditioning the initial inversion step on the CLIP embedding of the target description, we are able to obtain more successful edit directions. Additionally, we use a CLIP-guided refinement step to make corrections in the resulting residual latent codes, which further improves the alignment with the text prompt. Our method outperforms competing approaches in terms of manipulation accuracy and photo-realism on various domains including human faces, cats, and birds, as shown by our qualitative and quantitative results.
翻译:研究人员近期开始探索基于StyleGAN模型的真实图像编辑方法。其中一个特别有趣的应用是利用自然语言描述来指导编辑过程。现有的语言驱动图像编辑方法要么采用实例级潜码优化,要么将预定义的文本提示映射到潜空间中的某些编辑方向。然而,这些方法存在固有局限性:前者效率较低,而后者往往难以有效处理多属性变化。为解决这些问题,我们提出CLIPInverter——一种新型文本驱动图像编辑方法,能够高效可靠地执行多属性变化。该方法的核心在于将轻量级文本条件适配器层集成到预训练的GAN反演网络中。研究表明,通过将初始反演步骤与目标描述的CLIP嵌入进行条件化处理,能够获得更成功的编辑方向。此外,我们采用CLIP引导的精化步骤对生成的残差潜码进行修正,进一步提升了与文本提示的对齐程度。在包含人像、猫和鸟等多个领域的定性和定量实验中,我们的方法在操作精度和照片真实感方面均优于现有方案。