Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to substantial improvements in vision-language generative applications. Furthermore, leveraging its PAG property, CLIPAG enables text-to-image generation without any generative model, which typically requires huge generators.
翻译:感知对齐梯度(PAG)指鲁棒图像分类模型中观察到的一种有趣特性,即其输入梯度与人类感知对齐且具有语义含义。尽管该现象已引起广泛研究关注,但此前仅在纯视觉单模态架构中进行研究。本文将PAG研究拓展至视觉-语言架构——该架构构成多种图文任务与应用的基础。通过对CLIP进行对抗鲁棒性微调,我们证明鲁棒的视觉-语言模型相比其原始版本表现出PAG。本研究揭示了具备PAG的CLIP(CLIPAG)在若干视觉-语言生成任务中的优势。值得注意的是,我们证明以"即插即用"方式无缝集成CLIPAG可显著提升视觉-语言生成应用的性能。此外,利用其PAG特性,CLIPAG无需任何生成模型(通常需要庞大的生成器)即可实现文本到图像生成。