We introduce a Depicted image Quality Assessment method (DepictQA), overcoming the constraints of traditional score-based approaches. DepictQA leverages Multi-modal Large Language Models (MLLMs), allowing for detailed, language-based, human-like evaluation of image quality. Unlike conventional Image Quality Assessment (IQA) methods relying on scores, DepictQA interprets image content and distortions descriptively and comparatively, aligning closely with humans' reasoning process. To build the DepictQA model, we establish a hierarchical task framework, and collect a multi-modal IQA training dataset, named M-BAPPS. To navigate the challenges in limited training data and processing multiple images, we propose to use multi-source training data and specialized image tags. Our DepictQA demonstrates a better performance than score-based methods on the BAPPS benchmark. Moreover, compared with general MLLMs, our DepictQA can generate more accurate reasoning descriptive languages. Our research indicates that language-based IQA methods have the potential to be customized for individual preferences. Datasets and codes will be released publicly.
翻译:我们提出了一种描绘式图像质量评估方法(DepictQA),克服了传统基于评分方法的局限性。DepictQA利用多模态大语言模型(MLLMs),能够以细致、基于语言且类人的方式评估图像质量。与传统依赖评分的图像质量评估(IQA)方法不同,DepictQA以描述性和对比性的方式解释图像内容和失真,与人类的推理过程高度一致。为构建DepictQA模型,我们建立了一个分层任务框架,并收集了一个名为M-BAPPS的多模态IQA训练数据集。为应对有限训练数据和多图像处理带来的挑战,我们提出使用多源训练数据和专业图像标签。在BAPPS基准上,我们的DepictQA展现出优于基于评分方法的性能。此外,与通用MLLMs相比,我们的DepictQA能够生成更准确的推理描述性语言。我们的研究表明,基于语言的IQA方法具备根据个体偏好进行定制的潜力。数据集和代码将公开发布。