Multimodal Large Language Models (MLLMs) are gaining increasing popularity in both academia and industry due to their remarkable performance in various applications such as visual question answering, visual perception, understanding, and reasoning. Over the past few years, significant efforts have been made to examine MLLMs from multiple perspectives. This paper presents a comprehensive review of 200 benchmarks and evaluations for MLLMs, focusing on (1)perception and understanding, (2)cognition and reasoning, (3)specific domains, (4)key capabilities, and (5)other modalities. Finally, we discuss the limitations of the current evaluation methods for MLLMs and explore promising future directions. Our key argument is that evaluation should be regarded as a crucial discipline to support the development of MLLMs better. For more details, please visit our GitHub repository: https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey.
翻译:多模态大语言模型(MLLMs)因其在视觉问答、视觉感知、理解与推理等多种应用中的卓越表现,在学术界和工业界日益受到关注。过去几年间,研究者们已从多个角度对MLLMs进行了大量评测工作。本文系统综述了200项针对MLLMs的基准测试与评估,重点关注(1)感知与理解,(2)认知与推理,(3)特定领域,(4)关键能力,以及(5)其他模态。最后,我们讨论了当前MLLMs评估方法的局限性,并探讨了未来可能的发展方向。我们的核心观点是:评估应被视为一门支撑MLLMs更好发展的关键学科。更多详情,请访问我们的GitHub仓库:https://github.com/swordlidev/Evaluation-Multimodal-LLMs-Survey。