Large Multimodal Models (LMMs) have recently shown remarkable promise in low-level visual perception tasks, particularly in Image Quality Assessment (IQA), demonstrating strong zero-shot capability. However, achieving state-of-the-art performance often requires computationally expensive fine-tuning methods, which aim to align the distribution of quality-related token in output with image quality levels. Inspired by recent training-free works for LMM, we introduce IQARAG, a novel, training-free framework that enhances LMMs' IQA ability. IQARAG leverages Retrieval-Augmented Generation (RAG) to retrieve some semantically similar but quality-variant reference images with corresponding Mean Opinion Scores (MOSs) for input image. These retrieved images and input image are integrated into a specific prompt. Retrieved images provide the LMM with a visual perception anchor for IQA task. IQARAG contains three key phases: Retrieval Feature Extraction, Image Retrieval, and Integration & Quality Score Generation. Extensive experiments across multiple diverse IQA datasets, including KADID, KonIQ, LIVE Challenge, and SPAQ, demonstrate that the proposed IQARAG effectively boosts the IQA performance of LMMs, offering a resource-efficient alternative to fine-tuning for quality assessment.
翻译:大型多模态模型(LMMs)近期在底层视觉感知任务中展现出显著潜力,尤其在图像质量评估(IQA)方面表现出强大的零样本能力。然而,要达到最先进的性能通常需要计算成本高昂的微调方法,这些方法旨在使输出中与质量相关的标记分布与图像质量等级对齐。受近期针对LMM的无训练方法启发,我们提出了IQARAG——一种新颖的无训练框架,旨在增强LMMs的IQA能力。IQARAG利用检索增强生成(RAG)技术,为输入图像检索若干语义相似但质量不同的参考图像及其对应的平均意见分数(MOSs)。这些检索到的图像与输入图像被整合到一个特定提示中。检索图像为LMM执行IQA任务提供了视觉感知锚点。IQARAG包含三个关键阶段:检索特征提取、图像检索、以及整合与质量分数生成。在多个多样化IQA数据集(包括KADID、KonIQ、LIVE Challenge和SPAQ)上进行的大量实验表明,所提出的IQARAG有效提升了LMMs的IQA性能,为质量评估提供了一种资源高效的微调替代方案。