Multi-Modal Prompt Learning on Blind Image Quality Assessment

Image Quality Assessment (IQA) models benefit significantly from semantic information, which allows them to treat different types of objects distinctly. Currently, leveraging semantic information to enhance IQA is a crucial research direction. Traditional methods, hindered by a lack of sufficiently annotated data, have employed the CLIP image-text pretraining model as their backbone to gain semantic awareness. However, the generalist nature of these pre-trained Vision-Language (VL) models often renders them suboptimal for IQA-specific tasks. Recent approaches have attempted to address this mismatch using prompt technology, but these solutions have shortcomings. Existing prompt-based VL models overly focus on incremental semantic information from text, neglecting the rich insights available from visual data analysis. This imbalance limits their performance improvements in IQA tasks. This paper introduces an innovative multi-modal prompt-based methodology for IQA. Our approach employs carefully crafted prompts that synergistically mine incremental semantic information from both visual and linguistic data. Specifically, in the visual branch, we introduce a multi-layer prompt structure to enhance the VL model's adaptability. In the text branch, we deploy a dual-prompt scheme that steers the model to recognize and differentiate between scene category and distortion type, thereby refining the model's capacity to assess image quality. Our experimental findings underscore the effectiveness of our method over existing Blind Image Quality Assessment (BIQA) approaches. Notably, it demonstrates competitive performance across various datasets. Our method achieves Spearman Rank Correlation Coefficient (SRCC) values of 0.961(surpassing 0.946 in CSIQ) and 0.941 (exceeding 0.930 in KADID), illustrating its robustness and accuracy in diverse contexts.

翻译：图像质量评估（IQA）模型因语义信息的引入而显著受益，这使其能够对不同类型的目标进行差异化处理。当前，利用语义信息提升IQA性能已成为关键研究方向。传统方法受限于标注数据不足，采用CLIP图像-文本预训练模型作为骨干网络以获得语义感知能力。然而，这类预训练视觉-语言（VL）模型的通用性特质往往使其难以适配IQA特定任务。近期研究尝试通过提示技术解决这一不匹配问题，但现有方案存在缺陷：基于提示的VL模型过度关注文本中的增量语义信息，忽视了视觉数据分析蕴含的丰富见解，这种失衡限制了其在IQA任务中的性能提升。本文提出一种创新的多模态提示方法用于IQA，通过精心设计的提示协同挖掘视觉与语言数据中的增量语义信息。具体而言，在视觉分支中引入多层提示结构以增强VL模型适应性，在文本分支中部署双提示方案引导模型识别并区分场景类别与失真类型，从而优化图像质量评估能力。实验结果表明，本方法在多种盲图像质量评估（BIQA）方法中表现优异，尤其在多个数据集上展现出竞争性性能：在CSIQ数据集上获得Spearman秩相关系数（SRCC）0.961（超越原基准0.946），在KADID数据集上达到0.941（超越原基准0.930），充分验证了其在不同场景下的鲁棒性与准确性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日