Large vision-language models have recently achieved remarkable progress, exhibiting great perception and reasoning abilities concerning visual information. However, how to effectively evaluate these large vision-language models remains a major obstacle, hindering future model development. Traditional benchmarks like VQAv2 or COCO Caption provide quantitative performance measurements but suffer from a lack of fine-grained ability assessment and non-robust evaluation metrics. Recent subjective benchmarks, such as OwlEval, offer comprehensive evaluations of a model's abilities by incorporating human labor, but they are not scalable and display significant bias. In response to these challenges, we propose MMBench, a novel multi-modality benchmark. MMBench methodically develops a comprehensive evaluation pipeline, primarily comprised of two elements. The first element is a meticulously curated dataset that surpasses existing similar benchmarks in terms of the number and variety of evaluation questions and abilities. The second element introduces a novel CircularEval strategy and incorporates the use of ChatGPT. This implementation is designed to convert free-form predictions into pre-defined choices, thereby facilitating a more robust evaluation of the model's predictions. MMBench is a systematically-designed objective benchmark for robustly evaluating the various abilities of vision-language models. We hope MMBench will assist the research community in better evaluating their models and encourage future advancements in this domain. Project page: https://opencompass.org.cn/mmbench.
翻译:大型视觉-语言模型近期取得了显著进展,展现出对视觉信息的强大感知与推理能力。然而,如何有效评估这些大型视觉-语言模型仍是一大障碍,阻碍了未来模型的发展。传统的基准测试(如VQAv2或COCO Caption)虽能提供定量性能度量,但存在细粒度能力评估不足及评估指标不稳健的问题。近年来兴起的主观基准测试(如OwlEval)通过引入人工评估来综合评价模型能力,但其可扩展性差且存在显著偏差。为应对这些挑战,我们提出MMBench——一种新型多模态基准测试。MMBench系统性地构建了完整评估流程,主要包含两大部分:第一部分是精心整理的评估数据集,在评估问题与能力的数量及多样性上超越现有同类基准;第二部分则引入创新的CircularEval策略并结合ChatGPT,旨在将自由形式的预测转化为预定义选项,从而实现对模型预测的更稳健评估。MMBench是一个系统设计的客观基准测试,用于稳健评估视觉-语言模型的多项能力。我们期望MMBench能够帮助研究社区更好地评估模型,并推动该领域的未来发展。项目主页:https://opencompass.org.cn/mmbench。