Multimodal Large Language Model (MLLM) relies on the powerful LLM to perform multimodal tasks, showing amazing emergent abilities in recent studies, such as writing poems based on an image. However, it is difficult for these case studies to fully reflect the performance of MLLM, lacking a comprehensive evaluation. In this paper, we fill in this blank, presenting the first MLLM Evaluation benchmark MME. It measures both perception and cognition abilities on a total of 14 subtasks. In order to avoid data leakage that may arise from direct use of public datasets for evaluation, the annotations of instruction-answer pairs are all manually designed. The concise instruction design allows us to fairly compare MLLMs, instead of struggling in prompt engineering. Besides, with such an instruction, we can also easily carry out quantitative statistics. A total of 12 advanced MLLMs are comprehensively evaluated on our MME, which not only suggests that existing MLLMs still have a large room for improvement, but also reveals the potential directions for the subsequent model optimization.
翻译:多模态大语言模型(MLLM)借助强大的大语言模型执行多模态任务,在近期研究中展现出惊人的涌现能力,例如根据图像创作诗歌。然而,这类案例研究难以全面反映MLLM的性能,缺乏系统性评估。本文填补了这一空白,提出了首个MLLM评估基准MME,从感知与认知两大维度共14项子任务进行评测。为避免直接使用公开数据集可能引发的数据泄漏,所有指令-答案对的标注均由人工设计。简洁的指令设计使我们能够公平比较不同MLLM,而非受限于提示工程。此外,基于此类指令,我们可轻松进行定量统计分析。我们在MME上对12种先进MLLM进行了全面评估,结果不仅表明现有MLLM仍存在较大提升空间,还揭示了后续模型优化的潜在方向。