Multimodal Large Language Models (MLLMs) have achieved impressive results on various vision tasks, leveraging recent advancements in large language models. However, a critical question remains unaddressed: do MLLMs perceive visual information similarly to humans? Current benchmarks lack the ability to evaluate MLLMs from this perspective. To address this challenge, we introduce HVSBench, a large-scale benchmark designed to assess the alignment between MLLMs and the human visual system (HVS) on fundamental vision tasks that mirror human vision. HVSBench curated over 85K multimodal samples, spanning 13 categories and 5 fields in HVS, including Prominence, Subitizing, Prioritizing, Free-Viewing, and Searching. Extensive experiments demonstrate the effectiveness of our benchmark in providing a comprehensive evaluation of MLLMs. Specifically, we evaluate 13 MLLMs, revealing that even the best models show significant room for improvement, with most achieving only moderate results. Our experiments reveal that HVSBench presents a new and significant challenge for cutting-edge MLLMs. We believe that HVSBench will facilitate research on human-aligned and explainable MLLMs, marking a key step in understanding how MLLMs perceive and process visual information.
翻译:多模态大语言模型(MLLMs)凭借大语言模型领域的最新进展,已在各类视觉任务中取得显著成果。然而,一个关键问题尚未得到解答:MLLMs是否以与人类相似的方式感知视觉信息?现有基准测试缺乏从该角度评估MLLMs的能力。为应对这一挑战,我们提出了HVSBench——一个大规模基准测试集,旨在通过模拟人类视觉的基础视觉任务,评估MLLMs与人类视觉系统(HVS)的契合度。HVSBench精心构建了超过8.5万个多模态样本,涵盖HVS中显著性、数量快速感知、优先级处理、自由观察和视觉搜索等5个研究领域下的13个任务类别。大量实验证明,我们的基准测试能有效为MLLMs提供全面评估。具体而言,我们对13个MLLMs进行了测试,发现即使最优模型仍存在显著改进空间,多数模型仅取得中等性能。实验表明,HVSBench对前沿MLLMs构成了全新且重要的挑战。我们相信HVSBench将推动面向人类对齐与可解释性的MLLM研究,为理解MLLMs如何感知和处理视觉信息迈出关键一步。