Multimodal Large Language Models (MLLMs) have gained significant attention recently, showing remarkable potential in artificial general intelligence. However, assessing the utility of MLLMs presents considerable challenges, primarily due to the absence multimodal benchmarks that align with human preferences. Inspired by LLM-as-a-Judge in LLMs, this paper introduces a novel benchmark, termed MLLM-as-a-Judge, to assess the ability of MLLMs in assisting judges including three distinct tasks: Scoring Evaluation, Pair Comparison, and Batch Ranking. Our study reveals that, while MLLMs demonstrate remarkable human-like discernment in Pair Comparisons, there is a significant divergence from human preferences in Scoring Evaluation and Batch Ranking tasks. Furthermore, MLLMs still face challenges in judgment, including diverse biases, hallucinatory responses, and inconsistencies, even for advanced models such as GPT-4V. These findings emphasize the pressing need for enhancements and further research efforts regarding MLLMs as fully reliable evaluators. Code and dataset are available at https://github.com/Dongping-Chen/MLLM-as-a-Judge.
翻译:多模态大语言模型近期备受关注,在通用人工智能领域展现出显著潜力。然而,评估多模态大语言模型的实用性面临重大挑战,主要原因是缺乏与人类偏好对齐的多模态基准。受大语言模型裁判方法的启发,本文提出名为MLLM-as-a-Judge的新型基准,用于评估多模态大语言模型在辅助评判中的能力,包含三项不同任务:评分评估、成对比较与批次排序。研究表明,虽然多模态大语言模型在成对比较中展现出显著的人类辨别能力,但在评分评估与批次排序任务中与人类偏好存在显著差异。此外,即使是GPT-4V等先进模型,多模态大语言模型在评判过程中仍面临多样性偏差、幻觉响应与一致性不足等挑战。这些发现凸显了将多模态大语言模型发展为完全可靠评估器的迫切改进需求与研究必要性。代码与数据集详见https://github.com/Dongping-Chen/MLLM-as-a-Judge。