Q-Bench: A Benchmark for General-Purpose Foundation Models on Low-level Vision

The rapid evolution of Multi-modality Large Language Models (MLLMs) has catalyzed a shift in computer vision from specialized models to general-purpose foundation models. Nevertheless, there is still an inadequacy in assessing the abilities of MLLMs on low-level visual perception and understanding. To address this gap, we present Q-Bench, a holistic benchmark crafted to systematically evaluate potential abilities of MLLMs on three realms: low-level visual perception, low-level visual description, and overall visual quality assessment. a) To evaluate the low-level perception ability, we construct the LLVisionQA dataset, consisting of 2,990 diverse-sourced images, each equipped with a human-asked question focusing on its low-level attributes. We then measure the correctness of MLLMs on answering these questions. b) To examine the description ability of MLLMs on low-level information, we propose the LLDescribe dataset consisting of long expert-labelled golden low-level text descriptions on 499 images, and a GPT-involved comparison pipeline between outputs of MLLMs and the golden descriptions. c) Besides these two tasks, we further measure their visual quality assessment ability to align with human opinion scores. Specifically, we design a softmax-based strategy that enables MLLMs to predict quantifiable quality scores, and evaluate them on various existing image quality assessment (IQA) datasets. Our evaluation across the three abilities confirms that MLLMs possess preliminary low-level visual skills. However, these skills are still unstable and relatively imprecise, indicating the need for specific enhancements on MLLMs towards these abilities. We hope that our benchmark can encourage the research community to delve deeper to discover and enhance these untapped potentials of MLLMs. Project Page: https://vqassessment.github.io/Q-Bench.

翻译：多模态大语言模型（MLLMs）的快速演进推动计算机视觉从专用模型向通用基础模型转变。然而，当前对于MLLMs在低层次视觉感知与理解能力方面的评估仍显不足。为填补这一空白，我们提出Q-Bench——一个系统性评估MLLMs在三个领域潜在能力的整体基准：低层次视觉感知、低层次视觉描述及整体视觉质量评估。a) 为评估低层次感知能力，我们构建了LLVisionQA数据集，包含2990张多源图像，每张图像配有人工针对低层次属性提出的问题，并度量MLLMs回答的正确性。b) 为检验MLLMs对低层次信息的描述能力，我们提出LLDescribe数据集，包含499张图像对应的专家标注黄金标准低层次长文本描述，并设计基于GPT的对比流水线，用于比较MLLMs输出与黄金描述。c) 除上述两项任务外，我们进一步测量其视觉质量评估能力以对齐人类主观评分，具体采用基于softmax的策略使得MLLMs可预测可量化质量分数，并在多个现有图像质量评估（IQA）数据集上进行评估。针对三项能力的评估证实：MLLMs已具备初步的低层次视觉技能，但这些技能仍不稳定且精度有限，提示需针对这些能力进行专项增强。我们期望本基准能推动研究社区更深入探索并增强MLLMs中这些尚未被充分开发的潜力。项目页面：https://vqassessment.github.io/Q-Bench

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日