Benchmarking Table Comprehension In The Wild

Large Language Models (LLMs), while being increasingly dominant on a myriad of knowledge-intensive activities, have only had limited success understanding lengthy table-text mixtures, such as academic papers and financial reports. Recent advances of long-context LLMs have opened up new possibilities for this field. Nonetheless, we identify two roadblocks: (1) Prior benchmarks of table question answering (TableQA) have focused on isolated tables without context, making it hard to evaluate models in real-world scenarios. (2) Prior benchmarks have focused on some narrow skill sets of table comprehension such as table recognition, data manipulation/calculation, table summarization etc., while a skilled human employs those skills collectively. In this work, we introduce TableQuest, a new benchmark designed to evaluate the holistic table comprehension capabilities of LLMs in the natural table-rich context of financial reports. We employ a rigorous data processing and filtering procedure to ensure that the question-answer pairs are logical, reasonable, and diverse. We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations. We conclude with a qualitative study of the failure modes and discuss the challenges of constructing a challenging benchmark. We make the evaluation data, judging procedure and results of this study publicly available to facilitate research in this field.

翻译：大型语言模型（LLMs）在众多知识密集型任务中日益占据主导地位，但在理解长篇幅表格-文本混合内容（如学术论文和财务报告）方面仅取得有限成功。长上下文LLMs的最新进展为该领域开辟了新的可能性。然而，我们发现了两个主要障碍：（1）现有的表格问答（TableQA）基准测试主要针对孤立表格，缺乏上下文环境，难以评估模型在真实场景中的表现。（2）现有基准测试仅关注表格理解的某些狭窄技能，如表识别、数据操作/计算、表格摘要等，而熟练的人类会综合运用这些技能。本研究提出TableQuest——一个在财务报告的自然表格密集语境中评估LLMs整体表格理解能力的新基准。我们采用严格的数据处理和筛选流程，确保问答对具有逻辑性、合理性和多样性。通过对7个前沿模型的实验发现，尽管它们在事实定位方面表现出合理准确度，但在需要执行更复杂推理或多步计算时往往表现欠佳。最后我们通过定性研究分析失败模式，并探讨构建具有挑战性基准测试的难点。本研究公开提供评估数据、评判流程及实验结果，以推动该领域的研究发展。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日