At present, large multimodal models (LMMs) have exhibited impressive generalization capabilities in understanding and generating visual signals. However, they currently still lack sufficient capability to perceive low-level visual quality akin to human perception. Can LMMs achieve this and show the same degree of generalization in this regard? If so, not only could the versatility of LMMs be further enhanced, but also the challenge of poor cross-dataset performance in the field of visual quality assessment could be addressed. In this paper, we explore this question and provide the answer "Yes!". As the result of this initial exploration, we present VisualCritic, the first LMM for broad-spectrum image subjective quality assessment. VisualCritic can be used across diverse data right out of box, without any requirements of dataset-specific adaptation operations like conventional specialist models. As an instruction-following LMM, VisualCritic enables new capabilities of (1) quantitatively measuring the perceptual quality of given images in terms of their Mean Opinion Score (MOS), noisiness, colorfulness, sharpness, and other numerical indicators, (2) qualitatively evaluating visual quality and providing explainable descriptions, (3) discerning whether a given image is AI-generated or photographic. Extensive experiments demonstrate the efficacy of VisualCritic by comparing it with other open-source LMMs and conventional specialist models over both AI-generated and photographic images.
翻译:目前,大型多模态模型在理解和生成视觉信号方面展现出惊人的泛化能力,但它们在感知底层视觉质量方面仍缺乏与人类相似的充分能力。大型多模态模型能否实现这种感知,并在该领域展现出同等程度的泛化能力?若能实现,不仅可进一步提升其通用性,还能解决视觉质量评估领域跨数据集性能不佳的挑战。本文对此问题展开探索并给出肯定答案。作为初步探索的成果,我们提出了VisualCritic——首个适用于广谱图像主观质量评估的大型多模态模型。VisualCritic可直接应用于各类数据,无需像传统专用模型那样针对特定数据集进行适应性调整。作为遵循指令的模型,VisualCritic具备以下新能力:(1)通过平均意见分、噪声度、色彩丰富度、清晰度等数值指标定量测量给定图像的感知质量;(2)定性评估视觉质量并提供可解释性描述;(3)辨识给定图像由AI生成还是拍摄所得。大量实验表明,通过与其他开源大型多模态模型及传统专用模型在AI生成图像与摄影图像上的对比,VisualCritic具有显著有效性。