Image Quality Assessment (IQA) predicts perceptual quality scores consistent with human judgments. Recent RL-based IQA methods built on MLLMs focus on generating visual quality descriptions and scores, ignoring two key reliability limitations: (i) although the model's prediction stability varies significantly across training samples, existing GRPO-based methods apply uniform advantage weighting, thereby amplifying noisy signals from unstable samples in gradient updates; (ii) most works emphasize text-grounded reasoning over images while overlooking the model's visual perception ability of image content. In this paper, we propose Q-Hawkeye, an RL-based reliable visual policy optimization framework that redesigns the learning signal through unified Uncertainty-Aware Dynamic Optimization and Perception-Aware Optimization. Q-Hawkeye estimates predictive uncertainty using the variance of predicted scores across multiple rollouts and leverages this uncertainty to reweight each sample's update strength, stabilizing policy optimization. To strengthen perceptual reliability, we construct paired inputs of degraded images and their original images and introduce an Implicit Perception Loss that constrains the model to ground its quality judgments in genuine visual evidence. Extensive experiments demonstrate that Q-Hawkeye outperforms state-of-the-art methods and generalizes better across multiple datasets. Our dataset and code are available at https://github.com/AMAP-ML/Q-Hawkeye.
翻译:图像质量评估旨在预测与人类主观感知一致的质量分数。近期基于强化学习的IQA方法,构建于多模态大语言模型之上,侧重于生成视觉质量描述与分数,却忽视了两个关键的可靠性局限:(i)尽管模型在不同训练样本上的预测稳定性差异显著,现有基于GRPO的方法采用统一的优势权重,从而在梯度更新中放大了来自不稳定样本的噪声信号;(ii)大多数工作强调基于文本的图像推理,而忽视了模型对图像内容的视觉感知能力。本文提出Q-Hawkeye,一种基于强化学习的可靠视觉策略优化框架,通过统一的“不确定性感知动态优化”与“感知感知优化”重新设计学习信号。Q-Hawkeye利用多次rollout中预测分数的方差来估计预测不确定性,并利用该不确定性重新加权每个样本的更新强度,从而稳定策略优化。为增强感知可靠性,我们构建了退化图像及其原始图像的配对输入,并引入一种“隐式感知损失”,以约束模型将其质量判断基于真实的视觉证据。大量实验表明,Q-Hawkeye在多个数据集上优于现有先进方法,并展现出更好的泛化性能。我们的数据集与代码已公开于https://github.com/AMAP-ML/Q-Hawkeye。