Multimodal Large Language Models (MLLMs) have demonstrated significant potential to advance a broad range of domains. However, current benchmarks for evaluating MLLMs primarily emphasize general knowledge and vertical step-by-step reasoning typical of STEM disciplines, while overlooking the distinct needs and potential of the Humanities and Social Sciences (HSS). Tasks in the HSS domain require more horizontal, interdisciplinary thinking and a deep integration of knowledge across related fields, which presents unique challenges for MLLMs, particularly in linking abstract concepts with corresponding visual representations. Addressing this gap, we present HSSBench, a dedicated benchmark designed to assess the capabilities of MLLMs on HSS tasks in multiple languages, including the six official languages of the United Nations. We also introduce a novel data generation pipeline tailored for HSS scenarios, in which multiple domain experts and automated agents collaborate to generate and iteratively refine each sample. HSSBench contains over 13,000 meticulously designed samples, covering six key categories. We benchmark more than 20 mainstream MLLMs on HSSBench and demonstrate that it poses significant challenges even for state-of-the-art models. We hope that this benchmark will inspire further research into enhancing the cross-disciplinary reasoning abilities of MLLMs, especially their capacity to internalize and connect knowledge across fields.
翻译:多模态大语言模型(MLLMs)已展现出推动广泛领域发展的巨大潜力。然而,当前用于评估MLLMs的基准主要强调STEM学科典型的通用知识与纵向逐步推理能力,而忽视了人文与社会科学(HSS)的独特需求与潜力。HSS领域的任务需要更多横向、跨学科的思维以及对相关领域知识的深度融合,这对MLLMs提出了独特的挑战,尤其是在将抽象概念与对应的视觉表征相连接方面。为填补这一空白,我们提出了HSSBench——一个专门设计的基准,用于评估MLLMs在多种语言(包括联合国六种官方语言)下执行HSS任务的能力。我们还引入了一种专为HSS场景设计的新型数据生成流程,其中多位领域专家与自动化智能体协作生成并迭代优化每个样本。HSSBench包含超过13,000个精心设计的样本,涵盖六个关键类别。我们在HSSBench上对超过20个主流MLLM进行了基准测试,结果表明即使对于最先进的模型,该基准也构成了显著挑战。我们希望该基准能够激发进一步的研究,以增强MLLMs的跨学科推理能力,特别是其内化并连接跨领域知识的能力。