While there is much excitement about the potential of large multimodal models (LMM), a comprehensive evaluation is critical to establish their true capabilities and limitations. In support of this aim, we evaluate two state-of-the-art LMMs, GPT-4V and Gemini, on a new visual question answering dataset sourced from an authentic online question answering community. We conduct fine-grained analysis by generating seven types of metadata for nearly 2,000 visual questions, such as image type and the required image processing capabilities. Our zero-shot performance analysis highlights the types of questions that are most challenging for both models, including questions related to "puzzling" topic, with "Identification" user intention, with "Sheet Music" image type, or labeled as "hard" by GPT-4.
翻译:尽管人们对大型多模态模型(LMM)的潜力充满期待,但全面评估对于确立其真实能力和局限性至关重要。为此,我们基于来自真实在线问答社区的新型视觉问答数据集,评估了两种最先进的LMM——GPT-4V和Gemini。通过对近2000个视觉问题生成七类元数据(如图像类型和所需图像处理能力),我们进行了细粒度分析。零样本性能分析揭示了两类模型最具挑战性的问题类型,包括涉及“谜题”主题、用户意图为“识别”、图像类型为“乐谱”、或被GPT-4标注为“困难”的问题。