Extensive research has been conducted to explore the capability of Large Language Models (LLMs) for table reasoning and has significantly improved the performance on existing benchmarks. However, tables and user questions in real-world applications are more complex and diverse, presenting an unignorable gap compared to the existing benchmarks. To fill the gap, we propose a \textbf{M}ult\textbf{i}-scale spreadsheet benchmark with \textbf{M}eta \textbf{o}perations for \textbf{Table} reasoning, named as MiMoTable. Specifically, MiMoTable incorporates two key features. First, the tables in MiMoTable are all spreadsheets used in real-world scenarios, which cover seven domains and contain different types. Second, we define a new criterion with six categories of meta operations for measuring the difficulty of each question in MiMoTable, simultaneously as a new perspective for measuring the difficulty of the existing benchmarks. Experimental results show that Claude-3.5-Sonnet achieves the best performance with 77.4\% accuracy, indicating that there is still significant room to improve for LLMs on MiMoTable. Furthermore, we grade the difficulty of existing benchmarks according to our new criteria. Experiments have shown that the performance of LLMs decreases as the difficulty of benchmarks increases, thereby proving the effectiveness of our proposed new criterion.
翻译:针对大型语言模型在表格推理方面的能力已开展了广泛研究,其在现有基准测试上的性能已得到显著提升。然而,实际应用中的表格与用户查询更为复杂多样,与现有基准测试之间存在不可忽视的差距。为填补这一空白,我们提出了一个用于表格推理的**多**尺度电子表格基准,包含**元操作**(**M**ult**i**-scale spreadsheet benchmark with **M**eta **o**perations for **Table** reasoning),命名为MiMoTable。具体而言,MiMoTable包含两个关键特征:首先,其中的表格均为真实场景中使用的电子表格,涵盖七个领域并包含多种类型;其次,我们定义了一个包含六类元操作的新标准,用于衡量MiMoTable中每个问题的难度,同时为评估现有基准测试的难度提供了新视角。实验结果表明,Claude-3.5-Sonnet以77.4%的准确率取得了最佳性能,这说明LLMs在MiMoTable上仍有显著提升空间。此外,我们依据新标准对现有基准测试进行了难度分级。实验表明,随着基准测试难度的增加,LLMs的性能呈现下降趋势,从而验证了我们所提新标准的有效性。