Despite its real-world significance, model performance on tabular data remains underexplored, leaving uncertainty about which model to rely on and which prompt configuration to adopt. To address this gap, we create ToRR, a benchmark for Table Reasoning and Robustness, measuring model performance and robustness on table-related tasks. The benchmark includes 10 datasets that cover different types of table reasoning capabilities across varied domains. ToRR goes beyond model performance rankings, and is designed to reflect whether models can handle tabular data consistently and robustly, across a variety of common table representation formats. We present a leaderboard as well as comprehensive analyses of the results of leading models over ToRR. Our results reveal a striking pattern of brittle model behavior, where even strong models are unable to perform robustly on tabular data tasks. Although no specific table format leads to consistently better performance, we show that testing over multiple formats is crucial for reliably estimating model capabilities. Moreover, we show that the reliability boost from testing multiple prompts can be equivalent to adding more test examples. Overall, our findings show that table understanding and reasoning tasks remain a significant challenge.
翻译:尽管表格数据在现实世界中具有重要意义,但模型在表格数据上的性能仍未得到充分探索,导致在选择依赖何种模型以及采用何种提示配置方面存在不确定性。为填补这一空白,我们创建了ToRR(Table Reasoning and Robustness)基准测试,用于衡量模型在表格相关任务上的性能与鲁棒性。该基准包含10个数据集,涵盖不同领域内多种类型的表格推理能力。ToRR不仅提供模型性能排名,更旨在反映模型能否在各种常见表格表示格式中保持处理表格数据的一致性与鲁棒性。我们发布了排行榜,并对主流模型在ToRR上的结果进行了全面分析。研究结果揭示了模型行为存在显著的脆弱性模式:即使强模型也无法在表格数据任务中稳定发挥性能。尽管没有特定表格格式能持续带来更优性能,但我们证明通过多种格式进行测试对于可靠评估模型能力至关重要。此外,我们发现测试多种提示带来的可靠性提升效果,可与增加测试样本数量等效。总体而言,我们的研究结果表明表格理解与推理任务仍是重大挑战。