Image Quality Assessment (IQA) predicts perceptual quality scores consistent with human judgments. Recent RL-based IQA methods built on MLLMs focus on generating visual quality descriptions and scores, ignoring two key reliability limitations: (i) although the model's prediction stability varies significantly across training samples, existing GRPO-based methods apply uniform advantage weighting, thereby amplifying noisy signals from unstable samples in gradient updates; (ii) most works emphasize text-grounded reasoning over images while overlooking the model's visual perception ability of image content. In this paper, we propose Q-Hawkeye, an RL-based reliable visual policy optimization framework that redesigns the learning signal through unified Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization. Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts and leverages this uncertainty to reweight each sample's update strength, stabilizing policy optimization. To strengthen perceptual reliability, we construct paired inputs of degraded images and their original images and introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence. Extensive experiments demonstrate that Q-Hawkeye outperforms state-of-the-art methods and generalizes better across multiple datasets. The code and models will be made available.
翻译:图像质量评估旨在预测与人类感知一致的质量分数。当前基于强化学习的IQA方法主要建立在多模态大语言模型之上,侧重于生成视觉质量描述与分数,但忽视了以下两个关键可靠性局限:(i)尽管模型在不同训练样本上的预测稳定性差异显著,现有基于GRPO的方法采用统一的优势权重,从而在梯度更新中放大了来自不稳定样本的噪声信号;(ii)多数工作强调基于文本的图像推理,却忽视了模型对图像内容的视觉感知能力。本文提出Q-Hawkeye——一种基于强化学习的可靠视觉策略优化框架,通过统一的“不确定性感知动态优化”与“感知感知优化”重新设计学习信号。Q-Hawkeye利用多次推演中预测分数的方差来估计预测不确定性,并借助该不确定性重新加权每个样本的更新强度,从而稳定策略优化。为增强感知可靠性,我们构建了退化图像与其原始图像的配对输入,并引入隐式感知损失,约束模型将质量判断建立在真实的视觉证据基础上。大量实验表明,Q-Hawkeye在多个数据集上均优于现有最优方法,且展现出更好的泛化性能。代码与模型将公开提供。