Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community. Existing benchmarks primarily focus on simple natural image understanding, but Multi emerges as a cutting-edge benchmark for MLLMs, offering a comprehensive dataset for evaluating MLLMs against understanding complex figures and tables, and scientific questions. This benchmark, reflecting current realistic examination styles, provides multimodal inputs and requires responses that are either precise or open-ended, similar to real-life school tests. It challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis, and cross-modality reasoning. Multi includes over 18,000 questions, with a focus on science-based QA in diverse formats. We also introduce Multi-Elite, a 500-question subset for testing the extremities of MLLMs, and Multi-Extend, which enhances In-Context Learning research with more than 4,500 knowledge pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a 63.7% accuracy rate on Multi, in contrast to other MLLMs scoring between 31.3% and 53.7%. Multi serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
翻译:多模态大语言模型(MLLMs)的快速发展凸显了向学术界引入兼具挑战性与现实性的基准测试的必要性。现有基准测试主要聚焦于简单自然图像理解,而Multi作为面向MLLMs的前沿基准测试,提供了评估模型理解复杂图表、表格及科学问题的综合数据集。该基准测试采用反映现实考试风格的多元模态输入,要求模型作答方式同时涵盖精确回答与开放式作答,与真实学校测试场景高度一致。它通过公式推导、图像细节分析及跨模态推理等多样化任务对MLLMs形成挑战。Multi包含超过18,000道题目,重点覆盖多格式科学问答。我们还引入了Multi-Elite(含500道MLLMs极限能力测试子集)与Multi-Extend(含4,500余条知识条目以增强上下文学习研究)。评测结果显示,GPT-4V在Multi上达到63.7%的准确率,而其他MLLMs的得分区间为31.3%至53.7%,表明MLLMs存在显著提升空间。Multi不仅构建了稳健的评估平台,更为实现专家级人工智能铺平了道路。