The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess fundamental low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs.
翻译:多模态大语言模型的快速发展推动了计算机视觉从专用模型向通用基础模型的转变。然而,当前对多模态大语言模型在低级视觉感知与理解能力方面的评估仍存在不足。为此,我们提出Q-Bench,一个系统评估多模态大语言模型在三个领域潜在能力的综合基准:低级视觉感知、低级视觉描述与整体视觉质量评估。a) 为评估低级视觉感知能力,我们构建了LLVisionQA数据集,包含2990张多源图像,每张图像配有由人工提问的、聚焦低级属性的问题,并据此衡量多模态大语言模型回答的准确性。b) 为检验多模态大语言模型对低级信息的描述能力,我们提出LLDescribe数据集,包含499张图像的专业标注长文本黄金级低级描述,并引入基于GPT的对比流程来比较多模态大语言模型输出与黄金描述。c) 除上述两项任务外,我们进一步测量其视觉质量评估能力以与人类主观评分对齐。具体而言,我们设计了一种基于softmax的策略,使多模态大语言模型能够预测可量化的质量分数,并在多个现有图像质量评估数据集上进行验证。对这三项能力的评估证实多模态大语言模型具备基础的低级视觉技能,但这些技能仍不稳定且相对不精确,表明需针对这些能力进行专门增强。我们期望该基准能推动研究社区深入探索并提升多模态大语言模型这些尚未被充分开发的潜能。