Blind Image Quality Assessment via Transformer Predicted Error Map and Perceptual Quality Token

Image quality assessment is a fundamental problem in the field of image processing, and due to the lack of reference images in most practical scenarios, no-reference image quality assessment (NR-IQA), has gained increasing attention recently. With the development of deep learning technology, many deep neural network-based NR-IQA methods have been developed, which try to learn the image quality based on the understanding of database information. Currently, Transformer has achieved remarkable progress in various vision tasks. Since the characteristics of the attention mechanism in Transformer fit the global perceptual impact of artifacts perceived by a human, Transformer is thus well suited for image quality assessment tasks. In this paper, we propose a Transformer based NR-IQA model using a predicted objective error map and perceptual quality token. Specifically, we firstly generate the predicted error map by pre-training one model consisting of a Transformer encoder and decoder, in which the objective difference between the distorted and the reference images is used as supervision. Then, we freeze the parameters of the pre-trained model and design another branch using the vision Transformer to extract the perceptual quality token for feature fusion with the predicted error map. Finally, the fused features are regressed to the final image quality score. Extensive experiments have shown that our proposed method outperforms the current state-of-the-art in both authentic and synthetic image databases. Moreover, the attentional map extracted by the perceptual quality token also does conform to the characteristics of the human visual system.

翻译：图像质量评估是图像处理领域中的基础问题，由于在实际场景中多数情况下缺乏参考图像，无参考图像质量评估（NR-IQA）近年来受到越来越多的关注。随着深度学习技术的发展，许多基于深度神经网络的NR-IQA方法应运而生，这些方法试图通过理解数据库信息来学习图像质量。目前，Transformer已在各类视觉任务中取得显著进展。鉴于Transformer中注意力机制的特性符合人类对伪影的全局感知影响，因此Transformer非常适合图像质量评估任务。本文提出一种基于Transformer的NR-IQA模型，该模型利用预测的客观误差图和感知质量令牌。具体而言，我们首先通过预训练由Transformer编码器和解码器组成的模型生成预测误差图，其中失真图像与参考图像之间的客观差异被用作监督信号。接着，我们冻结预训练模型的参数，并设计另一个使用视觉Transformer的分支来提取感知质量令牌，用于与预测误差图进行特征融合。最后，将融合后的特征回归得到最终的图像质量分数。大量实验表明，所提方法在真实和合成图像数据库上均优于当前最先进方法。此外，由感知质量令牌提取的注意力图也符合人类视觉系统的特性。