We study full-reference image quality assessment from a machine-centric perspective, where images are evaluated by how well they preserve information for downstream models. We formulate machine-oriented quality as a latent machine utility and approximate it through pairwise predictive-consistency comparisons. To this end, we construct PCMP, a dataset of PSNR-matched distortion pairs labeled by consistency votes from multiple pretrained models. We further propose ML-CLIPSim, a differentiable quality metric built on a frozen CLIP visual encoder, which aggregates intermediate patch-token similarities and global image embeddings. Experiments on machine-preference benchmarks, human-IQA datasets, and learned image compression show that ML-CLIPSim better aligns with machine-oriented preferences than conventional fidelity and perceptual metrics, while remaining competitive for human quality prediction. Used as a compression distortion term, it improves rate--task trade-offs across multiple downstream tasks.
翻译:我们从机器中心视角研究全参考图像质量评估,即根据图像为下游模型保留信息的效果进行评价。我们将面向机器感知的质量形式化为潜在机器效用,并通过成对预测一致性比较进行近似。为此,我们构建了PCMP数据集,该数据集包含由多个预训练模型一致性投票标注的PSNR匹配失真图像对。我们进一步提出ML-CLIPSim——一种基于冻结CLIP视觉编码器的可微分质量指标,该指标聚合中间层补丁令牌相似度与全局图像嵌入。在机器偏好基准、人类IQA数据集及学习型图像压缩上的实验表明,与传统保真度指标和感知指标相比,ML-CLIPSim能更好对齐机器导向偏好,同时保持人类质量预测的竞争力。将其用作压缩失真项时,可改善多个下游任务的率-任务权衡。