ReMI: A Dataset for Reasoning with Multiple Images

Mehran Kazemi,Nishanth Dikkala,Ankit Anand,Petar Devic,Ishita Dasgupta,Fangyu Liu,Bahare Fatemi,Pranjal Awasthi,Dee Guo,Sreenivas Gollapudi,Ahmed Qureshi

With the continuous advancement of large language models (LLMs), it is essential to create new benchmarks to effectively evaluate their expanding capabilities and identify areas for improvement. This work focuses on multi-image reasoning, an emerging capability in state-of-the-art LLMs. We introduce ReMI, a dataset designed to assess LLMs' ability to Reason with Multiple Images. This dataset encompasses a diverse range of tasks, spanning various reasoning domains such as math, physics, logic, code, table/chart understanding, and spatial and temporal reasoning. It also covers a broad spectrum of characteristics found in multi-image reasoning scenarios. We have benchmarked several cutting-edge LLMs using ReMI and found a substantial gap between their performance and human-level proficiency. This highlights the challenges in multi-image reasoning and the need for further research. Our analysis also reveals the strengths and weaknesses of different models, shedding light on the types of reasoning that are currently attainable and areas where future models require improvement. To foster further research in this area, we are releasing ReMI publicly: https://huggingface.co/datasets/mehrankazemi/ReMI.

翻译：随着大型语言模型（LLMs）的持续进步，创建新的基准测试以有效评估其不断扩展的能力并识别改进方向变得至关重要。本工作聚焦于多图像推理——这一前沿LLMs的新兴能力。我们提出了ReMI，一个旨在评估LLMs多图像推理能力的数据集。该数据集涵盖多样化的任务类型，涉及数学、物理、逻辑、代码、表格/图表理解以及时空推理等多个推理领域，同时覆盖了多图像推理场景中的广泛特征。我们使用ReMI对多个尖端LLMs进行了基准测试，发现其性能与人类水平之间存在显著差距，这凸显了多图像推理的挑战性及进一步研究的必要性。我们的分析还揭示了不同模型的优势与不足，阐明了当前可实现的推理类型以及未来模型需要改进的方向。为促进该领域的进一步研究，我们已公开ReMI数据集：https://huggingface.co/datasets/mehrankazemi/ReMI。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日