Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

翻译：在视觉语言模型（VLM）的后训练阶段提升其性能通常依赖于监督微调或强化学习方法，这些方法需要昂贵的人工标注数据。尽管自监督技术已被证明能有效增强推理能力，但其在图像质量评估（IQA）等感知领域的应用仍鲜有探索。本文提出EvoQuality——一种新颖的框架，使VLM能够在无需任何真实标签的情况下自主优化其质量感知能力。EvoQuality将自一致性原理适配至IQA基于排序的特性，通过对其自身输出进行成对多数投票来建立关于相对质量的共识，从而生成伪标签。这些伪排序被进一步转化为保真度奖励，通过群体相对策略优化（GRPO）引导模型的迭代演化。通过重复利用自身预测结果，EvoQuality逐步精炼VLM的感知能力。大量实验表明，在多个IQA基准测试中，EvoQuality将基础VLM的零样本PLCC性能提升了31.8%。值得注意的是，尽管完全基于自监督方式，EvoQuality仍能达到甚至超越当前最先进的监督式VLM-IQA模型性能——在7个IQA基准测试中的5个上优于这些模型。此外，该框架展现出显著灵活性，可与预训练IQA模型堆叠使用以增强其在未见数据集上的泛化能力。代码与模型权重将发布于 https://github.com/bytedance/EvoQuality。

相关内容

排序

关注 313

排序是计算机内经常进行的一种操作，其目的是将一组“无序”的记录序列调整为“有序”的记录序列。分内部排序和外部排序。若整个排序过程不需要访问外存便能完成，则称此类排序问题为内部排序。反之，若参加排序的记录数量很大，整个序列的排序过程不可能在内存中完成，则称此类排序问题为外部排序。内部排序的过程是一个逐步扩大记录的有序序列长度的过程。

[ICML 2026] 看见的还是思考的？用奖励机制区分“看错”与“想错”：视觉语言模型奖励感知

专知会员服务

10+阅读 · 5月15日

面向视觉语言模型的持续学习：遗忘之外的综述与分类体系

专知会员服务

21+阅读 · 2025年8月9日

【ACMMM2025】EvoVLMA: 进化式视觉-语言模型自适应

专知会员服务

15+阅读 · 2025年8月5日

高效视觉语言模型研究综述

专知会员服务

14+阅读 · 2025年4月18日