Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of \textbf{180 benchmarks} and evaluation for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to better support the development of MLLMs. For more details, please visit our GitHub repository: https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey.
翻译:多模态大语言模型(MLLMs)因其在视觉问答、视觉感知、理解与推理等多种应用中的卓越表现,在学术界和工业界日益受到关注。过去几年间,研究者们已从多个角度对MLLMs进行了深入考察。本文系统综述了面向MLLMs的\textbf{180项基准测试}与评估工作,重点关注(1)感知与理解、(2)认知与推理、(3)特定领域、(4)关键能力以及(5)其他模态五个维度。最后,我们探讨了当前MLLMs评估方法的局限性,并展望了未来可能的发展方向。我们的核心观点是:评估应被视为一门至关重要的学科,以更好地支撑MLLMs的发展。更多细节请访问我们的GitHub仓库:https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey。