LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have recently played a dominant role in multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation of their efficacy. This paper presents a comprehensive evaluation of publicly available large multimodal models by building a LVLM evaluation Hub (LVLM-eHub). Our LVLM-eHub consists of $8$ representative LVLMs such as InstructBLIP and MiniGPT-4, which are thoroughly evaluated by a quantitative capability evaluation and an online arena platform. The former evaluates $6$ categories of multimodal capabilities of LVLMs such as visual question answering and embodied artificial intelligence on $47$ standard text-related visual benchmarks, while the latter provides the user-level evaluation of LVLMs in an open-world question-answering scenario. The study reveals several innovative findings. First, instruction-tuned LVLM with massive in-domain data such as InstructBLIP heavily overfits many existing tasks, generalizing poorly in the open-world scenario. Second, instruction-tuned LVLM with moderate instruction-following data may result in object hallucination issues (i.e., generate objects that are inconsistent with target images in the descriptions). It either makes the current evaluation metric such as CIDEr for image captioning ineffective or generates wrong answers. Third, employing a multi-turn reasoning evaluation framework can mitigate the issue of object hallucination, shedding light on developing an effective pipeline for LVLM evaluation. The findings provide a foundational framework for the conception and assessment of innovative strategies aimed at enhancing zero-shot multimodal techniques. Our LVLM-eHub will be available at https://github.com/OpenGVLab/Multi-Modality-Arena

翻译：大型视觉-语言模型（LVLMs）近期在多模态视觉-语言学习领域占据主导地位。尽管取得了巨大成功，但对其效能的全面评估仍有所欠缺。本文通过构建LVLM评估中心（LVLM-eHub），对公开可用的大型多模态模型进行了全面评估。我们的LVLM-eHub包含$8$个代表性LVLM（如InstructBLIP和MiniGPT-4），通过定量能力评估和在线竞技平台进行深入评测。前者在$47$个标准文本相关视觉基准上评估了LVLMs的$6$类多模态能力（如视觉问答和具身人工智能），后者则在开放世界问答场景中提供用户级评估。研究揭示了若干创新发现：第一，使用大量领域内数据进行指令微调的LVLM（如InstructBLIP）严重过拟合现有任务，在开放世界场景中泛化能力差；第二，采用适度指令遵循数据进行指令微调的LVLM易产生物体幻觉问题（即描述中生成与目标图像不一致的物体），这既导致图像描述指标（如CIDEr）失效，也会生成错误答案；第三，采用多轮推理评估框架可缓解物体幻觉问题，为开发有效的LVLM评估流程提供启示。这些发现为构思和评估增强零样本多模态技术的创新策略奠定了坚实基础。我们的LVLM-eHub将在https://github.com/OpenGVLab/Multi-Modality-Arena 开放获取。