Analysis of Deep Image Quality Models

Subjective image quality measures based on deep neural networks are very related to models of visual neuroscience. This connection benefits engineering but, more interestingly, the freedom to optimize deep networks in different ways, make them an excellent tool to explore the principles behind visual perception (both human and artificial). Recently, a myriad of networks have been successfully optimized for many interesting visual tasks. Although these nets were not specifically designed to predict image quality or other psychophysics, they have shown surprising human-like behavior. The reasons for this remain unclear. In this work, we perform a thorough analysis of the perceptual properties of pre-trained nets (particularly their ability to predict image quality) by isolating different factors: the goal (the function), the data (learning environment), the architecture, and the readout: selected layer(s), fine-tuning of channel relevance, and use of statistical descriptors as opposed to plain readout of responses. Several conclusions can be drawn. All the models correlate better with human opinion than SSIM. More importantly, some of the nets are in pair of state-of-the-art with no extra refinement or perceptual information. Nets trained for supervised tasks such as classification correlate substantially better with humans than LPIPS (a net specifically tuned for image quality). Interestingly, self-supervised tasks such as jigsaw also perform better than LPIPS. Simpler architectures are better than very deep nets. In simpler nets, correlation with humans increases with depth as if deeper layers were closer to human judgement. This is not true in very deep nets. Consistently with reports on illusions and contrast sensitivity, small changes in the image environment does not make a big difference. Finally, the explored statistical descriptors and concatenations had no major impact.

翻译：基于深度神经网络的主观图像质量度量与视觉神经科学模型密切相关。这种关联有助于工程应用，但更有趣的是，深度网络在不同优化方式下的灵活性使其成为探索（人类与人工）视觉感知背后原理的绝佳工具。近年来，大量网络已成功针对多种有趣的视觉任务进行了优化。尽管这些网络并非专门设计用于预测图像质量或完成其他心理物理学任务，但它们却展现出惊人的人类相似行为。其原因目前尚不明确。在本工作中，我们通过分离不同因素（目标（函数）、数据（学习环境）、架构以及读取方式：选定层级、通道相关性的微调、使用统计描述符而非直接读取响应），对预训练网络的感知特性（特别是预测图像质量的能力）进行了全面分析。我们得出若干结论：所有模型与人类主观评分的相关性均优于SSIM。更重要的是，部分网络无需额外优化或感知信息即可达到当前最优水平。为分类等监督任务训练的网络与人类评分的相关性显著高于LPIPS（一种专门针对图像质量调整的网络）。有趣的是，拼图等自监督任务的表现也优于LPIPS。较简单的架构优于极深网络。在简单网络中，与人类评分的相关性随深度增加而提升，仿佛更深层更接近人类判断。然而在极深网络中，这一规律并不成立。与关于错觉和对比敏感度的报告一致，图像环境的小幅变化不会造成显著影响。最后，所探索的统计描述符与拼接方法未产生重大影响。