With the increasing maturity of the text-to-image and image-to-image generative models, AI-generated images (AGIs) have shown great application potential in advertisement, entertainment, education, social media, etc. Although remarkable advancements have been achieved in generative models, very few efforts have been paid to design relevant quality assessment models. In this paper, we propose a novel blind image quality assessment (IQA) network, named AMFF-Net, for AGIs. AMFF-Net evaluates AGI quality from three dimensions, i.e., "visual quality", "authenticity", and "consistency". Specifically, inspired by the characteristics of the human visual system and motivated by the observation that "visual quality" and "authenticity" are characterized by both local and global aspects, AMFF-Net scales the image up and down and takes the scaled images and original-sized image as the inputs to obtain multi-scale features. After that, an Adaptive Feature Fusion (AFF) block is used to adaptively fuse the multi-scale features with learnable weights. In addition, considering the correlation between the image and prompt, AMFF-Net compares the semantic features from text encoder and image encoder to evaluate the text-to-image alignment. We carry out extensive experiments on three AGI quality assessment databases, and the experimental results show that our AMFF-Net obtains better performance than nine state-of-the-art blind IQA methods. The results of ablation experiments further demonstrate the effectiveness of the proposed multi-scale input strategy and AFF block.
翻译:随着文本到图像和图像到图像生成模型的日益成熟,AI生成图像在广告、教育、社交媒体等领域展现出巨大的应用潜力。尽管生成模型取得了显著进步,但针对其设计相关质量评估模型的研究尚属少数。本文提出了一种名为AMFF-Net的新型盲图像质量评估网络,用于AI生成图像。AMFF-Net从"视觉质量"、"真实性"和"一致性"三个维度评估AI生成图像质量。具体而言,受人类视觉系统特性启发,并基于"视觉质量"与"真实性"同时由局部和全局特征刻画的观察,AMFF-Net对图像进行缩放处理,将缩放图像与原尺寸图像共同作为输入以获取多尺度特征。随后,采用自适应特征融合模块通过可学习权重自适应融合多尺度特征。此外,考虑到图像与提示词之间的关联性,AMFF-Net通过对比文本编码器与图像编码器的语义特征来评估文本-图像对齐程度。我们在三个AI生成图像质量评估数据库上进行了大量实验,结果表明AMFF-Net在性能上优于九种主流盲图像质量评估方法。消融实验结果进一步验证了所提出的多尺度输入策略和自适应特征融合模块的有效性。