Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models

Yuansen Liu,Haiming Tang,Jinlong Peng,Jiangning Zhang,Xiaozhong Ji,Qingdong He,Wenbin Wu,Donghao Luo,Zhenye Gan,Junwei Zhu,Yunhang Shen,Chaoyou Fu,Chengjie Wang,Xiaobin Hu,Shuicheng Yan

Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.

翻译：多模态大语言模型（MLLMs）在视觉理解任务中展现出显著进展。然而，它们对于以人为中心的场景的理解能力却鲜有探索，这主要是由于缺乏兼顾面向人的细粒度层面与高维因果推理能力的综合性评估基准。鉴于人体物理结构的复杂性以及细粒度结构标注的困难，构建此类高质量评估基准面临严峻挑战。本文提出Human-MME，一个精心构建的基准，旨在为MLLMs在以人为中心的场景理解方面提供更全面的评估。与现有其他基准相比，我们的工作具备三个关键特征：1. 人类场景的多样性，涵盖4个主要视觉领域、15个二级领域和43个子领域，以确保广泛的场景覆盖。2. 渐进且多样的评估维度，从面向人的细粒度感知到高维推理，逐步评估基于人的活动，包含八个维度、19,945个真实世界图像-问题对以及一个评估套件。3. 高质量标注与丰富的数据范式，构建了自动化标注流程和人工标注平台，支持严格的标注工作，以促进精确可靠的模型评估。我们的基准通过构建选择、简答、定位、排序和判断问题组件及其组合的复杂问题，将单目标理解扩展到多人及多图像的相互理解。在17个最先进的MLLMs上进行的大量实验，有效揭示了现有模型的局限性，并为未来MLLMs研究朝向更好的以人为中心的图像理解提供了指引。所有数据与代码均公开于https://github.com/Yuan-Hou/Human-MME。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

14+阅读 · 2022年3月12日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日