A Survey on Evaluation of Multimodal Large Language Models

Multimodal Large Language Models (MLLMs) mimic human perception and reasoning system by integrating powerful Large Language Models (LLMs) with various modality encoders (e.g., vision, audio), positioning LLMs as the "brain" and various modality encoders as sensory organs. This framework endows MLLMs with human-like capabilities, and suggests a potential pathway towards achieving artificial general intelligence (AGI). With the emergence of all-round MLLMs like GPT-4V and Gemini, a multitude of evaluation methods have been developed to assess their capabilities across different dimensions. This paper presents a systematic and comprehensive review of MLLM evaluation methods, covering the following key aspects: (1) the background of MLLMs and their evaluation; (2) "what to evaluate" that reviews and categorizes existing MLLM evaluation tasks based on the capabilities assessed, including general multimodal recognition, perception, reasoning and trustworthiness, and domain-specific applications such as socioeconomic, natural sciences and engineering, medical usage, AI agent, remote sensing, video and audio processing, 3D point cloud analysis, and others; (3) "where to evaluate" that summarizes MLLM evaluation benchmarks into general and specific benchmarks; (4) "how to evaluate" that reviews and illustrates MLLM evaluation steps and metrics; Our overarching goal is to provide valuable insights for researchers in the field of MLLM evaluation, thereby facilitating the development of more capable and reliable MLLMs. We emphasize that evaluation should be regarded as a critical discipline, essential for advancing the field of MLLMs.

翻译：多模态大语言模型通过将强大的大语言模型与多种模态编码器（如视觉、音频）相结合，模拟人类的感知与推理系统，其中大语言模型被定位为"大脑"，而各种模态编码器则作为感觉器官。该框架赋予多模态大语言模型类人的能力，并暗示了实现通用人工智能的潜在路径。随着GPT-4V和Gemini等全能型多模态大语言模型的出现，学界已开发出多种评估方法来衡量其在不同维度的能力。本文对多模态大语言模型评估方法进行了系统而全面的综述，涵盖以下关键方面：（1）多模态大语言模型及其评估的背景；（2）"评估内容"——基于所评估的能力对现有多模态大语言模型评估任务进行回顾与分类，包括通用多模态识别、感知、推理与可信度评估，以及特定领域应用（如社会经济、自然科学与工程、医疗应用、AI智能体、遥感、视频与音频处理、3D点云分析等）；（3）"评估载体"——将多模态大语言模型评估基准归纳为通用基准与专用基准；（4）"评估方法"——回顾并阐释多模态大语言模型的评估步骤与指标。我们的总体目标是为多模态大语言模型评估领域的研究者提供有价值的见解，从而促进开发能力更强、更可靠的多模态大语言模型。我们强调，评估应被视为推动多模态大语言模型领域发展的关键学科。