The emergence of multimodal large models (MLMs) has significantly advanced the field of visual understanding, offering remarkable capabilities in the realm of visual question answering (VQA). Yet, the true challenge lies in the domain of knowledge-intensive VQA tasks, which necessitate not just recognition of visual elements, but also a deep comprehension of the visual information in conjunction with a vast repository of learned knowledge. To uncover such capabilities of MLMs, particularly the newly introduced GPT-4V, we provide an in-depth evaluation from three perspectives: 1) Commonsense Knowledge, which assesses how well models can understand visual cues and connect to general knowledge; 2) Fine-grained World Knowledge, which tests the model's skill in reasoning out specific knowledge from images, showcasing their proficiency across various specialized fields; 3) Comprehensive Knowledge with Decision-making Rationales, which examines model's capability to provide logical explanations for its inference, facilitating a deeper analysis from the interpretability perspective. Extensive experiments indicate that GPT-4V achieves SOTA performance on above three tasks. Interestingly, we find that: a) GPT-4V demonstrates enhanced reasoning and explanation when using composite images as few-shot; b) GPT-4V produces severe hallucinations when dealing with world knowledge, highlighting the future need for advancements in this research direction.
翻译:多模态大模型(MLMs)的出现显著推动了视觉理解领域的发展,在视觉问答(VQA)领域展现出卓越能力。然而,真正的挑战存在于知识密集型VQA任务中,这类任务不仅需要识别视觉元素,还需要结合庞大的习得知识库对视觉信息进行深度理解。为揭示MLMs(尤其是新推出的GPT-4V)的此类能力,我们从三个维度展开深度评估:1)常识知识:评估模型理解视觉线索并关联通用知识的能力;2)细粒度世界知识:测试模型从图像中推理特定知识的技能,展示其跨专业领域的熟练度;3)包含决策依据的综合知识:检验模型为其推理过程提供逻辑解释的能力,从可解释性角度促进深层分析。大量实验表明,GPT-4V在上述三类任务中均达到最优性能(SOTA)。有趣的是,我们进一步发现:a)使用复合图像进行少样本学习时,GPT-4V展现出更强的推理与解释能力;b)在处理世界知识时,GPT-4V会产生严重的幻觉现象,凸显该研究方向未来亟需突破。