Given the remarkable success that large visual language models (LVLMs) have achieved in image perception tasks, the endeavor to make LVMLs perceive the world like humans is drawing increasing attention. Current multi-modal benchmarks mainly focus on the objective fact or certain topic related potential knowledge within a image, but overlook the associative relations between multiple images. Therefore, we define a multi-image relation association task, and meticulously curate \textbf{MMRA} benchmark, a \textbf{M}ulti-granularity \textbf{M}ulti-image \textbf{R}elational \textbf{A}ssociation benchmark, consisted of \textbf{1026} samples. In order to systematically and comprehensively evaluate mainstream LVLMs, we establish an associational relation system among images that contain \textbf{11 subtasks} (e.g, UsageSimilarity, SubEvent, etc.) at two granularity levels (i.e., "\textbf{image}" and "\textbf{entity}") according to the relations in ConceptNet. Our experiments demonstrate that, on our MMRA benchmark, current mainstream LVLMs all have their own advantages and disadvantages across different subtasks. It is worth noting that, at the entity level, the performance of all models is worse than that of them at the image level, indicating that the fine-grained multi-image perception task is still challenging for LVLMs. The tasks related to spatial perception are relatively difficult for LVLMs to handle. Furthermore, we find that LVMLs exhibit a good ability to perceive image details, and the key to enhancing their multi-image association capability is to strengthen the reasoning ability of their language model component. All our codes and data are released at htt\url{https://github.com/Wusiwei0410/MMRA}.
翻译:鉴于大型视觉语言模型(LVLM)在图像感知任务中取得的显著成功,使LVLM像人类一样感知世界的努力正受到越来越多的关注。当前的多模态基准主要关注单幅图像内的客观事实或特定主题相关潜在知识,但忽视了多幅图像之间的关联关系。因此,我们定义了一个多图像关系关联任务,并精心构建了\textbf{MMRA}基准,这是一个\textbf{多}粒度\textbf{多}图像\textbf{关}系\textbf{关}联基准,包含\textbf{1026}个样本。为了系统而全面地评估主流LVLM,我们根据ConceptNet中的关系,在包含两个粒度级别(即“\textbf{图像}”和“\textbf{实体}”)的\textbf{11个子任务}(例如,用途相似性、子事件等)中,建立了图像间的关联关系体系。我们的实验表明,在我们的MMRA基准上,当前主流LVLM在不同子任务上各有优劣。值得注意的是,在实体级别,所有模型的性能均低于其在图像级别的性能,这表明细粒度的多图像感知任务对LVLM而言仍然具有挑战性。与空间感知相关的任务对LVLM而言相对难以处理。此外,我们发现LVLM表现出良好的图像细节感知能力,而增强其多图像关联能力的关键在于加强其语言模型组件的推理能力。我们所有的代码和数据均已发布于 \url{https://github.com/Wusiwei0410/MMRA}。