With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: https://huggingface.co/datasets/mehrankazemi/ReMI.
翻译:随着大型语言模型(LLMs)的持续进步,创建新的基准测试以有效评估其不断扩展的能力并识别改进方向变得至关重要。本工作聚焦于多图像推理——这一前沿LLMs的新兴能力。我们提出了ReMI,一个旨在评估LLMs多图像推理能力的数据集。该数据集涵盖多样化的任务类型,涉及数学、物理、逻辑、代码、表格/图表理解以及时空推理等多个推理领域,同时覆盖了多图像推理场景中的广泛特征。我们使用ReMI对多个尖端LLMs进行了基准测试,发现其性能与人类水平之间存在显著差距,这凸显了多图像推理的挑战性及进一步研究的必要性。我们的分析还揭示了不同模型的优势与不足,阐明了当前可实现的推理类型以及未来模型需要改进的方向。为促进该领域的进一步研究,我们已公开ReMI数据集:https://huggingface.co/datasets/mehrankazemi/ReMI。