In the pursuit of Artificial General Intelligence (AGI), the integration of vision in language models has marked a significant milestone. The advent of vision-language models (MLLMs) like GPT-4V have expanded AI applications, aligning with the multi-modal capabilities of the human brain. However, evaluating the efficacy of MLLMs poses a substantial challenge due to the subjective nature of tasks that lack definitive answers. Existing automatic evaluation methodologies on multi-modal large language models rely on objective queries that have standard answers, inadequately addressing the nuances of creative and associative multi-modal tasks. To address this, we introduce MLLM-Bench, an innovative benchmark inspired by Vicuna, spanning a diverse array of scenarios, including Perception, Understanding, Applying, Analyzing, Evaluating, and Creation along with the ethical consideration. MLLM-Bench is designed to reflect user experience more accurately and provide a more holistic assessment of model performance. Comparative evaluations indicate a significant performance gap between existing open-source models and GPT-4V. We posit that MLLM-Bench will catalyze progress in the open-source community towards developing user-centric vision-language models that meet a broad spectrum of real-world applications. See online leaderboard in \url{https://mllm-bench.llmzoo.com}.
翻译:在追求通用人工智能(AGI)的过程中,将视觉能力融入语言模型已成为一个重要里程碑。GPT-4V等视觉语言模型(MLLMs)的出现拓展了人工智能的应用场景,使其与人类大脑的多模态能力相契合。然而,由于缺乏明确答案的任务具有主观性,评估MLLMs的有效性面临重大挑战。现有多模态大语言模型的自动评估方法依赖于具有标准答案的客观查询,难以充分处理创造性及关联性多模态任务的细微特征。为此,我们提出MLLM-Bench——一项受Vicuna启发的创新基准测试,涵盖感知、理解、应用、分析、评估、创造及伦理考量等多类场景。MLLM-Bench旨在更准确地反映用户体验,并提供更全面的模型性能评估。对比评估表明,现有开源模型与GPT-4V之间存在显著性能差距。我们认为,MLLM-Bench将推动开源社区在发展面向用户、满足广泛实际应用需求的视觉语言模型方面取得进展。在线排行榜参见\url{https://mllm-bench.llmzoo.com}。