Understanding the deep semantics of images is essential in the era dominated by social media. However, current research works primarily on the superficial description of images, revealing a notable deficiency in the systematic investigation of the inherent deep semantics. In this work, we introduce DEEPEVAL, a comprehensive benchmark to assess Large Multimodal Models' (LMMs) capacities of visual deep semantics. DEEPEVAL includes human-annotated dataset and three progressive subtasks: fine-grained description selection, in-depth title matching, and deep semantics understanding. Utilizing DEEPEVAL, we evaluate 9 open-source LMMs and GPT-4V(ision). Our evaluation demonstrates a substantial gap between the deep semantic comprehension capabilities of existing LMMs and humans. For example, GPT-4V is 30% behind humans in understanding deep semantics, even though it achieves human-comparable performance in image description. Further analysis reveals that LMM performance on DEEPEVAL varies according to the specific facets of deep semantics explored, indicating the fundamental challenges remaining in developing LMMs.
翻译:在社交媒体主导的时代,理解图像的深层语义至关重要。然而,当前的研究工作主要集中在图像的表面描述上,对内在深层语义的系统性探究存在明显不足。在本研究中,我们提出了DEEPEVAL,一个用于评估大型多模态模型视觉深层语义理解能力的综合基准。DEEPEVAL包含人工标注的数据集和三个递进子任务:细粒度描述选择、深度标题匹配以及深层语义理解。利用DEEPEVAL,我们评估了9个开源大型多模态模型以及GPT-4V(ision)。我们的评估结果表明,现有大型多模态模型在深层语义理解能力方面与人类存在显著差距。例如,尽管GPT-4V在图像描述任务上达到了与人类相当的水平,但其在深层语义理解方面仍落后人类约30%。进一步分析表明,大型多模态模型在DEEPEVAL上的表现因所探究的深层语义具体维度而异,这揭示了开发大型多模态模型仍面临的根本性挑战。