MMRA: A Benchmark for Evaluating Multi-Granularity and Multi-Image Relational Association Capabilities in Large Visual Language Models

Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVLMs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks primarily focus on facts or specific topic-related knowledge contained within individual images. However, they often overlook the associative relations between multiple images, which require the identification and analysis of similarities among entities or content present in different images. Therefore, we propose the multi-image relation association task and a meticulously curated Multi-granularity Multi-image Relational Association (MMRA) benchmark, comprising 1,024 samples. In order to systematically and comprehensively evaluate current LVLMs, we establish an associational relation system among images that contain 11 subtasks (e.g, UsageSimilarity, SubEvent) at two granularity levels (i.e., image and entity) according to the relations in ConceptNet. Our experiments reveal that on the MMRA benchmark, current multi-image LVLMs exhibit distinct advantages and disadvantages across various subtasks. Notably, fine-grained, entity-level multi-image perception tasks pose a greater challenge for LVLMs compared to image-level tasks. Moreover, LVLMs perform poorly on spatial-related tasks, indicating that LVLMs still have limited spatial awareness. Additionally, our findings indicate that while LVLMs demonstrate a strong capability to perceive image details, enhancing their ability to associate information across multiple images hinges on improving the reasoning capabilities of their language model component. Moreover, we explored the ability of LVLMs to perceive image sequences within the context of our multi-image association task. Our experiments show that the majority of current LVLMs do not adequately model image sequences during the pre-training process.

翻译：鉴于大型视觉语言模型（LVLMs）在图像感知任务中取得的显著成功，使LVLMs能够像人类一样感知世界的努力正受到越来越多的关注。当前的多模态基准主要关注单张图像中包含的事实或特定主题相关知识，但往往忽略了多张图像之间的关联关系，这种关联关系要求识别并分析不同图像中实体或内容的相似性。为此，我们提出了多图像关系关联任务，并精心构建了包含1,024个样本的多粒度多图像关系关联（MMRA）基准。为了系统全面地评估现有LVLMs，我们根据ConceptNet中的关系，在包含两个粒度级别（即图像级和实体级）的11个子任务（如UsageSimilarity、SubEvent）中，建立了图像间的关联关系体系。我们的实验表明，在MMRA基准上，当前的多图像LVLMs在不同子任务上表现出明显的优势与不足。值得注意的是，与图像级任务相比，细粒度的实体级多图像感知任务对LVLMs构成了更大挑战。此外，LVLMs在空间相关任务上表现不佳，表明其空间感知能力仍然有限。同时，我们的发现指出，尽管LVLMs展现出强大的图像细节感知能力，但提升其在多图像间进行信息关联的能力，关键在于增强其语言模型组件的推理能力。此外，我们在多图像关联任务的背景下探索了LVLMs感知图像序列的能力。实验表明，当前大多数LVLMs在预训练过程中并未充分建模图像序列。