Image Quality Assessment (IQA) constitutes a fundamental task within the field of computer vision, yet it remains an unresolved challenge, owing to the intricate distortion conditions, diverse image contents, and limited availability of data. Recently, the community has witnessed the emergence of numerous large-scale pretrained foundation models, which greatly benefit from dramatically increased data and parameter capacities. However, it remains an open problem whether the scaling law in high-level tasks is also applicable to IQA task which is closely related to low-level clues. In this paper, we demonstrate that with proper injection of local distortion features, a larger pretrained and fixed foundation model performs better in IQA tasks. Specifically, for the lack of local distortion structure and inductive bias of vision transformer (ViT), alongside the large-scale pretrained ViT, we use another pretrained convolution neural network (CNN), which is well known for capturing the local structure, to extract multi-scale image features. Further, we propose a local distortion extractor to obtain local distortion features from the pretrained CNN and a local distortion injector to inject the local distortion features into ViT. By only training the extractor and injector, our method can benefit from the rich knowledge in the powerful foundation models and achieve state-of-the-art performance on popular IQA datasets, indicating that IQA is not only a low-level problem but also benefits from stronger high-level features drawn from large-scale pretrained models.
翻译:图像质量评估(IQA)是计算机视觉领域的基础任务,但由于复杂失真条件、多样化图像内容以及数据有限性,该问题仍未得到完全解决。近年来,大量大规模预训练基础模型不断涌现,显著得益于数据规模与参数容量的急剧增长。然而,高阶任务中的尺度定律是否同样适用于与底层线索密切相关的图像质量评估任务,仍是一个开放性问题。本文证明:通过合理注入局部失真特征,更大的预训练固定基础模型能在IQA任务中取得更优表现。具体而言,针对视觉Transformer(ViT)缺乏局部失真结构及归纳偏置的问题,我们在大规模预训练ViT之外,额外采用以捕获局部结构著称的预训练卷积神经网络(CNN)提取多尺度图像特征。进一步,我们提出局部失真提取器从预训练CNN中获取局部失真特征,并设计局部失真注入器将这些特征注入ViT。通过仅训练提取器与注入器,该方法可充分利用强大基础模型中的丰富知识,在主流IQA数据集上达到最先进性能。这表明图像质量评估不仅是一个底层视觉问题,更能从大规模预训练模型提供的强高阶特征中受益。