Psychophysical experiments remain the most reliable approach for perceptual image quality assessment (IQA), yet their cost and limited scalability encourage automated approaches. We investigate whether Vision Language Models (VLMs) can approximate human perceptual judgments across three image quality scales: contrast, colorfulness and overall preference. Six VLMs four proprietary and two openweight models are benchmarked against psychophysical data. This work presents a systematic benchmark of VLMs for perceptual IQA through comparison with human psychophysical data. The results reveal strong attribute dependent variability models with high human alignment for colorfulness (ρup to 0.93) underperform on contrast and vice-versa. Attribute weighting analysis further shows that most VLMs assign higher weights to colorfulness compared to contrast when evaluating overall preference similar to the psychophysical data. Intramodel consistency analysis reveals a counterintuitive tradeoff: the most self consistent models are not necessarily the most human aligned suggesting response variability reflects sensitivity to scene dependent perceptual cues. Furthermore, human-VLM agreement is increased with perceptual separability, indicating VLMs are more reliable when stimulus differences are clearly expressed.
翻译:心理物理实验仍然是感知图像质量评估(IQA)最可靠的方法,但其成本高昂且可扩展性有限,促使研究者探索自动化方法。本文研究视觉语言模型(VLM)能否在对比度、色彩丰富度和整体偏好三个图像质量尺度上近似人类感知判断。我们对六种VLM——四种专有模型和两种开源模型——进行了心理物理数据基准测试。本研究通过与人眼心理物理数据对比,系统性地评估了VLM在感知IQA上的表现。结果揭示了强烈的属性依赖性差异:在色彩丰富度评估中与人类高度对齐(斯皮尔曼相关系数高达0.93)的模型,在对比度评估中表现不佳,反之亦然。属性权重分析进一步表明,大多数VLM在评估整体偏好时赋予色彩丰富度高于对比度的权重,这与心理物理数据一致。模型内部一致性分析揭示了反直觉的权衡:自一致性最高的模型未必与人类对齐最好,这表明响应变异性反映了模型对场景依赖感知线索的敏感性。此外,人类与VLM的一致性随感知可分离性增强而提高,这表明当刺激差异明确表达时,VLM更为可靠。