A comprehensive evaluation is critical to assess the capabilities of large multimodal models (LMM). In this study, we evaluate the state-of-the-art LMMs, namely GPT-4V and Gemini, utilizing the VQAonline dataset. VQAonline is an end-to-end authentic VQA dataset sourced from a diverse range of everyday users. Compared previous benchmarks, VQAonline well aligns with real-world tasks. It enables us to effectively evaluate the generality of an LMM, and facilitates a direct comparison with human performance. To comprehensively evaluate GPT-4V and Gemini, we generate seven types of metadata for around 2,000 visual questions, such as image type and the required image processing capabilities. Leveraging this array of metadata, we analyze the zero-shot performance of GPT-4V and Gemini, and identify the most challenging questions for both models.
翻译:全面评估对于理解大型多模态模型(LMM)的能力至关重要。本研究利用VQAonline数据集对当前最先进的LMM——GPT-4V和Gemini进行评测。VQAonline是一个源自多样化日常用户群体的端到端真实视觉问答数据集。相较于现有基准测试,VQAonline更贴合实际任务场景,使我们能有效评估LMM的泛化能力,并直接对比人类表现。为系统评估GPT-4V和Gemini,我们为约2000个视觉问题生成了七类元数据(如图像类型和所需图像处理能力)。借助这些元数据,我们分析了GPT-4V和Gemini的零样本性能,并识别出对两个模型最具挑战性的问题。