Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 10 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.
翻译:多模态大语言模型(MLLM)借助强大的大语言模型执行多模态任务,在近期研究中展现出惊人的涌现能力,例如根据图像写诗。然而,这些案例研究难以全面反映MLLM的性能,缺乏系统性评估。本文填补了这一空白,首次提出面向MLLM的评估基准MME。该基准涵盖感知与认知两大能力,共包含14个子任务。为避免直接使用公开数据集进行评估可能引发的数据泄露问题,所有指令-答案对的标注均采用人工设计。简洁的指令设计使我们能够公平比较不同MLLM,而非陷入提示工程中的困境。此外,借助此类指令,我们还可轻松进行定量统计。我们在MME上对10个先进MLLM进行了全面评估,结果不仅表明现有MLLM仍有巨大提升空间,还揭示了后续模型优化的潜在方向。