Global-Local Progressive Integration Network for Blind Image Quality Assessment

Vision transformers (ViTs) excel in computer vision for modeling long-term dependencies, yet face two key challenges for image quality assessment (IQA): discarding fine details during patch embedding, and requiring extensive training data due to lack of inductive biases. In this study, we propose a Global-Local progressive INTegration network for IQA, called GlintIQA, to address these issues through three key components: 1) Hybrid feature extraction combines ViT-based global feature extractor (VGFE) and convolutional neural networks (CNNs)-based local feature extractor (CLFE) to capture global coarse-grained features and local fine-grained features, respectively. The incorporation of CNNs mitigates the patch-level information loss and inductive bias constraints inherent to ViT architectures. 2) Progressive feature integration leverages diverse kernel sizes in embedding to spatially align coarse- and fine-grained features, and progressively aggregate these features by interactively stacking channel-wise attention and spatial enhancement modules to build effective quality-aware representations. 3) Content similarity-based labeling approach is proposed that automatically assigns quality labels to images with diverse content based on subjective quality scores. This addresses the scarcity of labeled training data in synthetic datasets and bolsters model generalization. The experimental results demonstrate the efficacy of our approach, yielding 5.04% average SROCC gains on cross-authentic dataset evaluations. Moreover, our model and its counterpart pre-trained on the proposed dataset respectively exhibited 5.40% and 13.23% improvements on across-synthetic datasets evaluation. The codes and proposed dataset will be released at https://github.com/XiaoqiWang/GlintIQA.

翻译：视觉Transformer（ViT）在计算机视觉中擅长建模长程依赖关系，但在图像质量评估（IQA）任务中面临两大挑战：在图像块嵌入过程中丢弃细节信息，以及因缺乏归纳偏置而需要大量训练数据。本研究提出一种用于IQA的全局-局部渐进集成网络（GlintIQA），通过三个核心组件解决上述问题：1）混合特征提取结合基于ViT的全局特征提取器（VGFE）与基于卷积神经网络（CNN）的局部特征提取器（CLFE），分别捕获全局粗粒度特征和局部细粒度特征。CNN的引入缓解了ViT架构固有的图像块级信息丢失和归纳偏置限制。2）渐进特征集成利用嵌入过程中不同尺寸的卷积核实现粗/细粒度特征的空间对齐，并通过交互堆叠通道注意力与空间增强模块逐步聚合特征，构建有效的质量感知表征。3）提出基于内容相似度的标注方法，依据主观质量分数为不同内容图像自动分配质量标签。该方法缓解了合成数据集中标注训练数据的稀缺性问题，增强了模型泛化能力。实验结果表明，所提方法在跨真实数据集评估中实现了5.04%的平均SROCC提升。此外，基于所构建数据集预训练的模型及其变体在跨合成数据集评估中分别取得5.40%和13.23%的性能改进。代码与构建的数据集将在https://github.com/XiaoqiWang/GlintIQA发布。