Rapid progress in multimodal large language models (MLLMs) highlights the need to introduce challenging yet realistic benchmarks to the academic community, while existing benchmarks primarily focus on understanding simple natural images and short context. In this paper, we present MULTI as a cutting-edge benchmark for evaluating MLLMs on understanding complex tables and images, and reasoning with long context. MULTI provides multimodal inputs and requires responses that are either precise or open-ended, reflecting real-life examination styles. MULTI includes over 18,000 questions and challenges MLLMs with a variety of tasks, ranging from formula derivation to image detail analysis and cross-modality reasoning. We also introduce MULTI-Elite, a 500-question selected hard subset, and MULTI-Extend, with more than 4,500 external knowledge context pieces. Our evaluation indicates significant potential for MLLM advancement, with GPT-4V achieving a 63.7% accuracy rate on MULTI, in contrast to other MLLMs scoring between 28.5% and 55.3%. MULTI serves not only as a robust evaluation platform but also paves the way for the development of expert-level AI.
翻译:多模态大语言模型(MLLMs)的快速发展凸显了向学术界引入具有挑战性且贴近现实基准的必要性,而现有基准主要聚焦于理解简单自然图像和短上下文。本文提出MULTI,一个用于评估MLLMs在复杂表格与图像理解、长上下文推理方面的前沿基准。MULTI提供多模态输入,要求模型给出精确或开放式的回答,以反映真实考试风格。该基准包含超过1.8万道题目,通过公式推导、图像细节分析及跨模态推理等多样化任务对MLLMs发起挑战。我们同时推出含500道精选难题的MULTI-Elite子集,以及覆盖4500余条外部知识上下文的MULTI-Extend。评估结果表明MLLMs仍有显著提升空间:GPT-4V在MULTI上达到63.7%的准确率,而其他MLLMs得分介于28.5%至55.3%之间。MULTI不仅作为稳健的评估平台,更为开发专家级人工智能铺平道路。