II-Bench: An Image Implication Understanding Benchmark for Multimodal Large Language Models

Ziqiang Liu,Feiteng Fang,Xi Feng,Xinrun Du,Chenhao Zhang,Zekun Wang,Yuelin Bai,Qixuan Zhao,Liyang Fan,Chengguang Gan,Hongquan Lin,Jiaming Li,Yuansheng Ni,Haihong Wu,Yaswanth Narsupalli,Zhigang Zheng,Chengming Li,Xiping Hu,Ruifeng Xu,Xiaojun Chen,Min Yang,Jiaheng Liu,Ruibo Liu,Wenhao Huang,Ge Zhang,Shiwen Ni

from arxiv, 100 pages, 82 figures, add citations

The rapid advancements in the development of multimodal large language models (MLLMs) have consistently led to new breakthroughs on various benchmarks. In response, numerous challenging and comprehensive benchmarks have been proposed to more accurately assess the capabilities of MLLMs. However, there is a dearth of exploration of the higher-order perceptual capabilities of MLLMs. To fill this gap, we propose the Image Implication understanding Benchmark, II-Bench, which aims to evaluate the model's higher-order perception of images. Through extensive experiments on II-Bench across multiple MLLMs, we have made significant findings. Initially, a substantial gap is observed between the performance of MLLMs and humans on II-Bench. The pinnacle accuracy of MLLMs attains 74.8%, whereas human accuracy averages 90%, peaking at an impressive 98%. Subsequently, MLLMs perform worse on abstract and complex images, suggesting limitations in their ability to understand high-level semantics and capture image details. Finally, it is observed that most models exhibit enhanced accuracy when image sentiment polarity hints are incorporated into the prompts. This observation underscores a notable deficiency in their inherent understanding of image sentiment. We believe that II-Bench will inspire the community to develop the next generation of MLLMs, advancing the journey towards expert artificial general intelligence (AGI). II-Bench is publicly available at https://huggingface.co/datasets/m-a-p/II-Bench.

翻译：多模态大语言模型（MLLMs）的快速发展持续在各种基准测试中取得突破性进展。为此，研究者们提出了众多具有挑战性且全面的基准测试，以更精准评估MLLMs的能力。然而，针对MLLMs高阶感知能力的探索仍显不足。为弥补这一空白，我们提出了图像隐含理解基准II-Bench，旨在评估模型对图像的高阶感知能力。通过在多个MLLMs上开展II-Bench大规模实验，我们获得了重要发现。首先，MLLMs与人类在II-Bench上的表现存在显著差距：MLLMs的最高准确率达74.8%，而人类平均准确率达90%，最高可达98%。其次，MLLMs在抽象和复杂图像上的表现更差，表明其在理解高层语义和捕捉图像细节方面存在局限性。最后，观察到当在提示中加入图像情感极性线索时，大多数模型的准确率有所提升。这一发现揭示了模型在图像情感内在理解方面的显著缺陷。我们相信II-Bench将启发社区开发下一代MLLMs，推动迈向专家级通用人工智能（AGI）的进程。II-Bench已公开于https://huggingface.co/datasets/m-a-p/II-Bench。