Contrastive Language-Image Pre-Training (CLIP) has refreshed the state of the art for a broad range of vision-language cross-modal tasks. Particularly, it has created an intriguing research line of text-guided image style transfer, dispensing with the need for style reference images as in traditional style transfer methods. However, directly using CLIP to guide the transfer of style leads to undesirable artifacts (mainly written words and unrelated visual entities) spread over the image, partly due to the entanglement of visual and written concepts inherent in CLIP. Inspired by the use of spectral analysis in filtering linguistic information at different granular levels, we analyse the patch embeddings from the last layer of the CLIP vision encoder from the perspective of spectral analysis and find that the presence of undesirable artifacts is highly correlated to some certain frequency components. We propose SpectralCLIP, which implements a spectral filtering layer on top of the CLIP vision encoder, to alleviate the artifact issue. Experimental results show that SpectralCLIP prevents the generation of artifacts effectively in quantitative and qualitative terms, without impairing the stylisation quality. We further apply SpectralCLIP to text-conditioned image generation and show that it prevents written words in the generated images. Code is available at https://github.com/zipengxuc/SpectralCLIP.
翻译:对比语言-图像预训练(CLIP)刷新了广泛视觉-语言跨模态任务的最新技术水平。特别是,它开创了文本引导图像风格迁移这一引人瞩目的研究方向,无需像传统风格迁移方法那样依赖风格参考图像。然而,直接使用CLIP引导风格迁移会导致图像中出现不良伪影(主要是文字和无关视觉实体),部分原因是CLIP固有的视觉与文字概念的纠缠。受频谱分析在不同粒度层次过滤语言信息的启发,我们从频谱分析角度分析了CLIP视觉编码器最后一层的图像块嵌入,发现不良伪影的存在与某些频率分量高度相关。我们提出SpectralCLIP,在CLIP视觉编码器之上实现频谱过滤层,以缓解伪影问题。实验结果表明,SpectralCLIP在定量和定性上有效防止了伪影的生成,且不损害风格化质量。我们进一步将SpectralCLIP应用于文本条件图像生成,并展示其能防止生成图像中出现文字。代码见https://github.com/zipengxuc/SpectralCLIP。