Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy.
翻译:近期提出的对比语言-图像预训练(CLIP)通过将图像和文本嵌入至联合潜在空间,架起了两者之间的桥梁。这开启了大量旨在通过文本解释对输入图像进行操作的文献研究。然而,由于图像与文本嵌入在联合空间中存在差异,将文本嵌入作为优化目标往往会在生成图像中引入非预期伪影。同时,在编辑过程中,解耦性、可解释性与可控性也难以保证。为解决这些问题,我们提出定义由相关提示词张成的语料子空间以捕捉特定图像特征。我们引入CLIP投影增强嵌入(PAE)作为优化目标,提升文本引导图像编辑的性能。该方法是一种简单且通用的范式,易于计算与适配,可平滑集成至任意基于CLIP的图像编辑算法中。为验证方法有效性,我们进行了多项理论与实证研究。以文本引导语义人脸编辑为案例,我们通过定量与定性分析证明,PAE能够以最先进的品质与精度实现更具解耦性、可解释性与可控性的图像编辑。