Recently introduced Contrastive Language-Image Pre-Training (CLIP) bridges images and text by embedding them into a joint latent space. This opens the door to ample literature that aims to manipulate an input image by providing a textual explanation. However, due to the discrepancy between image and text embeddings in the joint space, using text embeddings as the optimization target often introduces undesired artifacts in the resulting images. Disentanglement, interpretability, and controllability are also hard to guarantee for manipulation. To alleviate these problems, we propose to define corpus subspaces spanned by relevant prompts to capture specific image characteristics. We introduce CLIP Projection-Augmentation Embedding (PAE) as an optimization target to improve the performance of text-guided image manipulation. Our method is a simple and general paradigm that can be easily computed and adapted, and smoothly incorporated into any CLIP-based image manipulation algorithm. To demonstrate the effectiveness of our method, we conduct several theoretical and empirical studies. As a case study, we utilize the method for text-guided semantic face editing. We quantitatively and qualitatively demonstrate that PAE facilitates a more disentangled, interpretable, and controllable image manipulation with state-of-the-art quality and accuracy. Project page: https://chenliang-zhou.github.io/CLIP-PAE/.
翻译:最近提出的对比语言-图像预训练(CLIP)通过将图像与文本嵌入到联合潜在空间,实现了跨模态连接。这为大量旨在通过文本描述操纵输入图像的研究开辟了道路。然而,由于联合空间中图像与文本嵌入存在差异,直接使用文本嵌入作为优化目标往往会在生成图像中引入非期望的伪影。同时,操作过程的解耦性、可解释性与可控性也难以保证。为缓解这些问题,我们提出通过相关提示词构建语料子空间以捕捉特定图像特征。本文引入CLIP投影增强嵌入(PAE)作为优化目标,以提升文本引导图像操作的性能。该方法是一种简洁通用的范式,易于计算与适配,并可无缝集成到任何基于CLIP的图像操作算法中。为验证方法的有效性,我们开展了多项理论与实证研究。作为案例研究,我们将该方法应用于文本引导的语义人脸编辑。通过定量与定性分析表明,PAE能够以当前最优的质量和精度实现更具解耦性、可解释性与可控性的图像操作。项目页面:https://chenliang-zhou.github.io/CLIP-PAE/。