Perceptually Aligned Gradients (PAG) refer to an intriguing property observed in robust image classification models, wherein their input gradients align with human perception and pose semantic meanings. While this phenomenon has gained significant research attention, it was solely studied in the context of unimodal vision-only architectures. In this work, we extend the study of PAG to Vision-Language architectures, which form the foundations for diverse image-text tasks and applications. Through an adversarial robustification finetuning of CLIP, we demonstrate that robust Vision-Language models exhibit PAG in contrast to their vanilla counterparts. This work reveals the merits of CLIP with PAG (CLIPAG) in several vision-language generative tasks. Notably, we show that seamlessly integrating CLIPAG in a "plug-n-play" manner leads to substantial improvements in vision-language generative applications. Furthermore, leveraging its PAG property, CLIPAG enables text-to-image generation without any generative model, which typically requires huge generators.
翻译:感知对齐梯度(PAG)是指鲁棒图像分类模型中观察到的一种有趣特性,其输入梯度与人类感知一致并具有语义含义。尽管这一现象已获得大量研究关注,但先前仅在纯视觉架构的单模态背景下进行研究。本研究将PAG的探讨扩展至视觉-语言架构,这类架构构成了多种图文任务与应用的基础。通过对CLIP进行对抗鲁棒性微调,我们证明鲁棒的视觉-语言模型相较于原始模型展现了PAG特性。本研究揭示了具备PAG的CLIP(CLIPAG)在多项视觉-语言生成任务中的优势。值得注意的是,我们证明以“即插即用”方式无缝集成CLIPAG可显著提升视觉-语言生成应用的表现。此外,利用其PAG特性,CLIPAG能够在无需任何生成模型(通常需要大型生成器)的情况下实现文本到图像生成。