Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets.
翻译:在训练后阶段提升视觉语言模型(VLMs)通常依赖于监督微调或强化学习,这些方法需要昂贵的人工标注数据。尽管自监督技术在增强推理能力方面已被证明有效,但其在图像质量评估(IQA)等感知领域的应用仍基本未被探索。本文提出EvoQuality,一种新颖的框架,使VLM能够在无需任何真实标签的情况下自主优化其质量感知能力。EvoQuality将自一致性原理适配于IQA基于排序的特性。它通过对VLM自身输出进行成对多数投票以建立关于相对质量的共识,从而生成伪标签。这些伪排序随后被构建为保真度奖励,通过组相对策略优化(GRPO)指导模型的迭代进化。通过迭代利用自身预测,EvoQuality逐步提升VLM的感知能力。大量实验表明,EvoQuality在多样化的IQA基准上将基础VLM的零样本性能提升了31.8%(基于PLCC指标)。值得注意的是,尽管完全采用自监督,EvoQuality实现了与最先进的基于监督VLM的IQA模型相当甚至更优的性能,在7个IQA基准中的5个上超越了这些模型。此外,该框架展现出显著的灵活性,可与预训练的IQA模型堆叠使用,以增强在未见数据集上的泛化能力。