自进化视觉语言模型：通过投票与排序实现图像质量评估 (Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking)

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets.

翻译：在训练后阶段提升视觉语言模型（VLMs）通常依赖于监督微调或强化学习，这些方法需要昂贵的人工标注数据。尽管自监督技术在增强推理能力方面已被证明有效，但其在图像质量评估（IQA）等感知领域的应用仍基本未被探索。本文提出EvoQuality，一种新颖的框架，使VLM能够在无需任何真实标签的情况下自主优化其质量感知能力。EvoQuality将自一致性原理适配于IQA基于排序的特性。它通过对VLM自身输出进行成对多数投票以建立关于相对质量的共识，从而生成伪标签。这些伪排序随后被构建为保真度奖励，通过组相对策略优化（GRPO）指导模型的迭代进化。通过迭代利用自身预测，EvoQuality逐步提升VLM的感知能力。大量实验表明，EvoQuality在多样化的IQA基准上将基础VLM的零样本性能提升了31.8%（基于PLCC指标）。值得注意的是，尽管完全采用自监督，EvoQuality实现了与最先进的基于监督VLM的IQA模型相当甚至更优的性能，在7个IQA基准中的5个上超越了这些模型。此外，该框架展现出显著的灵活性，可与预训练的IQA模型堆叠使用，以增强在未见数据集上的泛化能力。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

专知会员服务

15+阅读 · 2025年8月5日

视觉语言模型泛化到新领域：全面综述

专知会员服务

38+阅读 · 2025年6月27日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日