The advent of large language models (LLMs) has unlocked great opportunities in complex data management tasks, particularly in question answering (QA) over complicated multi-table relational data. Despite significant progress, systematically evaluating LLMs on multi-table QA remains a critical challenge due to the inherent complexity of analyzing heterogeneous table structures and potential large scale of serialized relational data. Existing benchmarks primarily focus on single-table QA, failing to capture the intricacies of reasoning across multiple relational tables, as required in real-world domains such as finance, healthcare, and e-commerce. To address this gap, we present TQA-Bench, a new multi-table QA benchmark designed to evaluate the capabilities of LLMs in tackling complex QA tasks over relational data. Our benchmark incorporates diverse relational database instances sourced from real-world public datasets and introduces a flexible sampling mechanism to create tasks with varying multi-table context lengths, ranging from 8K to 64K tokens. To ensure robustness and reliability, we integrate symbolic extensions into the evaluation framework, enabling the assessment of LLM reasoning capabilities beyond simple data retrieval or probabilistic pattern matching. We systematically evaluate a range of LLMs, both open-source and closed-source, spanning model scales from 7 billion to 70 billion parameters. Our extensive experiments reveal critical insights into the performance of LLMs in multi-table QA, highlighting both challenges and opportunities for advancing their application in complex, data-driven environments. Our benchmark implementation and results are available at https://github.com/Relaxed-System-Lab/TQA-Bench.
翻译:大语言模型(LLMs)的出现为复杂数据管理任务,尤其是在处理涉及多表关系数据的问答(QA)任务方面,开启了巨大机遇。尽管取得了显著进展,但由于异构表结构分析的固有复杂性以及序列化关系数据可能的大规模性,系统性地评估LLMs在多表QA上的性能仍是一个关键挑战。现有基准测试主要集中于单表QA,未能捕捉跨多个关系表进行推理的复杂性,而这种推理在金融、医疗和电子商务等现实领域中是必需的。为填补这一空白,我们提出了TQA-Bench,这是一个新的多表QA基准测试,旨在评估LLMs在处理关系数据上的复杂QA任务的能力。我们的基准测试整合了来自现实世界公共数据集的多样化关系数据库实例,并引入了一种灵活的采样机制,以创建具有不同多表上下文长度(范围从8K到64K词元)的任务。为确保鲁棒性和可靠性,我们在评估框架中集成了符号扩展,从而能够评估LLMs超越简单数据检索或概率模式匹配的推理能力。我们系统性地评估了一系列开源和闭源的LLMs,模型规模涵盖70亿到700亿参数。我们的大量实验揭示了LLMs在多表QA中性能的关键见解,突显了在复杂、数据驱动的环境中推进其应用所面临的挑战与机遇。我们的基准测试实现与结果可在 https://github.com/Relaxed-System-Lab/TQA-Bench 获取。