Large Language Models (LLMs), while being increasingly dominant on a myriad of knowledge-intensive activities, have only had limited success understanding lengthy table-text mixtures, such as academic papers and financial reports. Recent advances of long-context LLMs have opened up new possibilities for this field. Nonetheless, we identify two roadblocks: (1) Prior benchmarks of table question answering (TableQA) have focused on isolated tables without context, making it hard to evaluate models in real-world scenarios. (2) Prior benchmarks have focused on some narrow skill sets of table comprehension such as table recognition, data manipulation/calculation, table summarization etc., while a skilled human employs those skills collectively. In this work, we introduce TableQuest, a new benchmark designed to evaluate the holistic table comprehension capabilities of LLMs in the natural table-rich context of financial reports. We employ a rigorous data processing and filtering procedure to ensure that the question-answer pairs are logical, reasonable, and diverse. We experiment with 7 state-of-the-art models, and find that despite reasonable accuracy in locating facts, they often falter when required to execute more sophisticated reasoning or multi-step calculations. We conclude with a qualitative study of the failure modes and discuss the challenges of constructing a challenging benchmark. We make the evaluation data, judging procedure and results of this study publicly available to facilitate research in this field.
翻译:大型语言模型(LLMs)在众多知识密集型任务中日益占据主导地位,但在理解长篇幅表格-文本混合内容(如学术论文和财务报告)方面仅取得有限成功。长上下文LLMs的最新进展为该领域开辟了新的可能性。然而,我们发现了两个主要障碍:(1)现有的表格问答(TableQA)基准测试主要针对孤立表格,缺乏上下文环境,难以评估模型在真实场景中的表现。(2)现有基准测试仅关注表格理解的某些狭窄技能,如表识别、数据操作/计算、表格摘要等,而熟练的人类会综合运用这些技能。本研究提出TableQuest——一个在财务报告的自然表格密集语境中评估LLMs整体表格理解能力的新基准。我们采用严格的数据处理和筛选流程,确保问答对具有逻辑性、合理性和多样性。通过对7个前沿模型的实验发现,尽管它们在事实定位方面表现出合理准确度,但在需要执行更复杂推理或多步计算时往往表现欠佳。最后我们通过定性研究分析失败模式,并探讨构建具有挑战性基准测试的难点。本研究公开提供评估数据、评判流程及实验结果,以推动该领域的研究发展。