Recently, textual prompt tuning has shown inspirational performance in adapting Contrastive Language-Image Pre-training (CLIP) models to natural image quality assessment. However, such uni-modal prompt learning method only tunes the language branch of CLIP models. This is not enough for adapting CLIP models to AI generated image quality assessment (AGIQA) since AGIs visually differ from natural images. In addition, the consistency between AGIs and user input text prompts, which correlates with the perceptual quality of AGIs, is not investigated to guide AGIQA. In this letter, we propose vision-language consistency guided multi-modal prompt learning for blind AGIQA, dubbed CLIP-AGIQA. Specifically, we introduce learnable textual and visual prompts in language and vision branches of CLIP models, respectively. Moreover, we design a text-to-image alignment quality prediction task, whose learned vision-language consistency knowledge is used to guide the optimization of the above multi-modal prompts. Experimental results on two public AGIQA datasets demonstrate that the proposed method outperforms state-of-the-art quality assessment models. The source code is available at https://github.com/JunFu1995/CLIP-AGIQA.
翻译:近年来,文本提示调优在将对比语言-图像预训练(CLIP)模型适配至自然图像质量评估任务中展现出启发性性能。然而,此类单模态提示学习方法仅对CLIP模型的语言分支进行调优。这对于将CLIP模型适配至AI生成图像质量评估(AGIQA)任务而言尚不充分,因为AI生成图像在视觉特性上与自然图像存在差异。此外,AI生成图像与用户输入文本提示之间的一致性——该特性与AI生成图像的感知质量相关——尚未被用于指导AGIQA研究。本文提出一种视觉-语言一致性引导的多模态提示学习方法,用于盲AI生成图像质量评估,命名为CLIP-AGIQA。具体而言,我们分别在CLIP模型的语言分支与视觉分支中引入可学习的文本提示与视觉提示。此外,我们设计了文本-图像对齐质量预测任务,其学习到的视觉-语言一致性知识被用于指导上述多模态提示的优化过程。在两个公开AGIQA数据集上的实验结果表明,所提方法优于当前最先进的图像质量评估模型。源代码公开于https://github.com/JunFu1995/CLIP-AGIQA。