The advance of large language models (LLMs) has unlocked great opportunities in complex multi-modal data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing the modality of relational data structures and the potentially large scale of serialized tabular data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of connections across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. We present TQA-Bench, a long-context analytical multi-table QA benchmark derived from real-world public datasets, with a flexible sampling mechanism that varies context length (8K--64K tokens) and symbolic extensions for assessing reasoning beyond retrieval and pattern matching. We systematically evaluate a set of LLMs spanning model scales from 2 billion to 671 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments.
翻译:大语言模型的进步为复杂的多模态数据管理任务,特别是在涉及多表关系型数据的问答任务中,带来了巨大机遇。然而,由于关系型数据结构模态本身固有的复杂性,以及序列化表格数据可能达到的庞大规模,如何系统性地评估大语言模型在多表问答上的表现仍是一项关键挑战。现有基准测试主要集中在单表问答上,难以捕捉金融、医疗和电商等真实世界领域所需的多张关系型表之间的复杂联系。我们提出了TQA-Bench,这是一个基于真实世界公开数据集的、面向长上下文分析性多表问答的基准测试。它配备了一种灵活的采样机制,可改变上下文长度(8K--64K个标记),并提供了符号扩展,以评估模型超越检索和模式匹配的推理能力。我们系统性地评估了一系列参数量从20亿到6710亿不等的大语言模型。大量的实验揭示了多表问答中大语言模型性能的关键洞见,指出了在复杂、数据驱动环境中推进其应用所面临的挑战与机遇。