Multimodal Large Language Models (MLLMs) have demonstrated significant advances in visual understanding tasks. However, their capacity to comprehend human-centric scenes has rarely been explored, primarily due to the absence of comprehensive evaluation benchmarks that take into account both the human-oriented granular level and higher-dimensional causal reasoning ability. Such high-quality evaluation benchmarks face tough obstacles, given the physical complexity of the human body and the difficulty of annotating granular structures. In this paper, we propose Human-MME, a curated benchmark designed to provide a more holistic evaluation of MLLMs in human-centric scene understanding. Compared with other existing benchmarks, our work provides three key features: 1. Diversity in human scene, spanning 4 primary visual domains with 15 secondary domains and 43 sub-fields to ensure broad scenario coverage. 2. Progressive and diverse evaluation dimensions, evaluating the human-based activities progressively from the human-oriented granular perception to the higher-dimensional reasoning, consisting of eight dimensions with 19,945 real-world image question pairs and an evaluation suite. 3. High-quality annotations with rich data paradigms, constructing the automated annotation pipeline and human-annotation platform, supporting rigorous manual labeling to facilitate precise and reliable model assessment. Our benchmark extends the single-target understanding to the multi-person and multi-image mutual understanding by constructing the choice, short-answer, grounding, ranking and judgment question components, and complex questions of their combination. The extensive experiments on 17 state-of-the-art MLLMs effectively expose the limitations and guide future MLLMs research toward better human-centric image understanding. All data and code are available at https://github.com/Yuan-Hou/Human-MME.
翻译:多模态大语言模型(MLLMs)在视觉理解任务中已展现出显著进展。然而,其对于以人为中心场景的理解能力却鲜有探索,这主要是由于缺乏能够兼顾面向人的细粒度层面与高维因果推理能力的综合性评估基准。鉴于人体物理结构的复杂性以及细粒度结构标注的困难,构建此类高质量评估基准面临严峻挑战。本文提出Human-MME,一个精心构建的基准,旨在为MLLMs在以人为中心的场景理解方面提供更全面的评估。与现有其他基准相比,我们的工作具备三个关键特征:1. 人类场景的多样性,涵盖4个主要视觉领域、15个二级领域和43个子领域,以确保广泛的场景覆盖。2. 渐进且多样的评估维度,从面向人的细粒度感知到高维推理,逐步评估基于人类的活动,包含八个维度、19,945个真实世界图像-问题对以及一个评估套件。3. 高质量标注与丰富的数据范式,构建了自动化标注流程和人工标注平台,支持严格的标注工作,以促进精确可靠的模型评估。我们的基准通过构建选择题、简答题、定位题、排序题和判断题组件,以及这些组件组合而成的复杂问题,将单目标理解扩展到多人与多图像的相互理解。在17个最先进的MLLMs上进行的广泛实验有效地揭示了其局限性,并为未来MLLMs朝向更好的以人为中心的图像理解研究提供了指引。所有数据和代码均可在 https://github.com/Yuan-Hou/Human-MME 获取。