With the rapid advancement of multimodal large language models (MLLMs), their evaluation has become increasingly comprehensive. However, understanding long multimodal content, as a foundational ability for real-world applications, remains underexplored. In this work, we present Needle In A Multimodal Haystack (MM-NIAH), the first benchmark specifically designed to systematically evaluate the capability of existing MLLMs to comprehend long multimodal documents. Our benchmark includes three types of evaluation tasks: multimodal retrieval, counting, and reasoning. In each task, the model is required to answer the questions according to different key information scattered throughout the given multimodal document. Evaluating the leading MLLMs on MM-NIAH, we observe that existing models still have significant room for improvement on these tasks, especially on vision-centric evaluation. We hope this work can provide a platform for further research on long multimodal document comprehension and contribute to the advancement of MLLMs. Code and benchmark are released at https://github.com/OpenGVLab/MM-NIAH.
翻译:随着多模态大语言模型(MLLMs)的快速发展,其评估日益全面。然而,理解长多模态内容作为实际应用中的基础能力,仍未被充分探索。本文提出首个系统评估现有MLLMs理解长多模态文档能力的基准——多模态海量文档中的“针”(MM-NIAH)。该基准包含三类评估任务:多模态检索、计数与推理。每项任务要求模型根据给定多模态文档中分散的关键信息回答问题。通过评估主流MLLMs在MM-NIAH上的表现,我们发现现有模型在这些任务上仍有显著提升空间,尤其在以视觉为中心的评估中。期望本研究能为长多模态文档理解提供研究平台,助力MLLMs的发展。代码与基准已发布于https://github.com/OpenGVLab/MM-NIAH。