A Benchmark for Multi-modal Foundation Models on Low-level Vision: from Single Images to Pairs

The rapid development of Multi-modality Large Language Models (MLLMs) has navigated a paradigm shift in computer vision, moving towards versatile foundational models. However, evaluating MLLMs in low-level visual perception and understanding remains a yet-to-explore domain. To this end, we design benchmark settings to emulate human language responses related to low-level vision: the low-level visual perception (A1) via visual question answering related to low-level attributes (e.g. clarity, lighting); and the low-level visual description (A2), on evaluating MLLMs for low-level text descriptions. Furthermore, given that pairwise comparison can better avoid ambiguity of responses and has been adopted by many human experiments, we further extend the low-level perception-related question-answering and description evaluations of MLLMs from single images to image pairs. Specifically, for perception (A1), we carry out the LLVisionQA+ dataset, comprising 2,990 single images and 1,999 image pairs each accompanied by an open-ended question about its low-level features; for description (A2), we propose the LLDescribe+ dataset, evaluating MLLMs for low-level descriptions on 499 single images and 450 pairs. Additionally, we evaluate MLLMs on assessment (A3) ability, i.e. predicting score, by employing a softmax-based approach to enable all MLLMs to generate quantifiable quality ratings, tested against human opinions in 7 image quality assessment (IQA) datasets. With 24 MLLMs under evaluation, we demonstrate that several MLLMs have decent low-level visual competencies on single images, but only GPT-4V exhibits higher accuracy on pairwise comparisons than single image evaluations (like humans). We hope that our benchmark will motivate further research into uncovering and enhancing these nascent capabilities of MLLMs. Datasets will be available at https://github.com/Q-Future/Q-Bench.

翻译：多模态大语言模型（MLLMs）的快速发展引领了计算机视觉领域的范式转变，逐步迈向通用基础模型。然而，评估MLLMs在低级视觉感知与理解方面的能力仍是一个尚未充分探索的领域。为此，我们设计了基准测试任务，以模拟与低级视觉相关的人类语言响应：低级视觉感知任务（A1）通过针对低级属性（如清晰度、光照）的视觉问答进行评估；低级视觉描述任务（A2）则评估MLLMs对低级视觉内容的文本描述能力。此外，考虑到成对比较能更有效地避免响应歧义且已被许多人类实验采用，我们将MLLMs的低级视觉感知问答与描述评估从单图像扩展至图像对。具体而言，对于感知任务（A1），我们构建了LLVisionQA+数据集，包含2990张单图像和1999个图像对，每个样本均附带一个关于其低级特征的开放式问题；对于描述任务（A2），我们提出了LLDescribe+数据集，在499张单图像和450个图像对上评估MLLMs的低级视觉描述能力。此外，我们通过采用基于softmax的方法使所有MLLMs生成可量化的质量评分，在7个图像质量评估（IQA）数据集上与人类意见进行比对，评估了MLLMs的评估能力（A3）（即分数预测）。在对24个MLLMs的评估中，我们发现多个模型在单图像上展现出不错的低级视觉能力，但仅有GPT-4V在成对比较任务上表现出优于单图像评估的准确性（与人类表现一致）。我们期望本基准测试能推动针对MLLMs新兴能力的探索与增强。数据集将发布于https://github.com/Q-Future/Q-Bench。