The rapid evolution of Multimodal Large Language Models (MLLMs) has brought substantial advancements in artificial intelligence, significantly enhancing the capability to understand and generate multimodal content. While prior studies have largely concentrated on model architectures and training methodologies, a thorough analysis of the benchmarks used for evaluating these models remains underexplored. This survey addresses this gap by systematically reviewing 211 benchmarks that assess MLLMs across four core domains: understanding, reasoning, generation, and application. We provide a detailed analysis of task designs, evaluation metrics, and dataset constructions, across diverse modalities. We hope that this survey will contribute to the ongoing advancement of MLLM research by offering a comprehensive overview of benchmarking practices and identifying promising directions for future work. An associated GitHub repository collecting the latest papers is available.
翻译:多模态大语言模型(MLLMs)的快速发展为人工智能领域带来了显著进步,极大增强了对多模态内容的理解与生成能力。尽管先前的研究主要集中于模型架构与训练方法,但对用于评估这些模型的基准测试进行全面分析仍显不足。本综述通过系统性地回顾211个评估MLLMs在理解、推理、生成与应用四大核心领域的基准测试,填补了这一空白。我们对不同模态下的任务设计、评估指标及数据集构建进行了详细分析。我们期望本综述能为MLLM研究的持续发展做出贡献,通过提供基准测试实践的全面概览,并指出未来工作的潜在方向。相关的GitHub仓库收集了最新论文以供参考。