In this paper, we study the problem of numerical multi-table question answering (MTQA) over large-scale table collections (e.g., online data repositories). This task is essential in many analytical applications. Existing MTQA solutions, such as text-to-SQL or open-domain MTQA methods, are designed for databases and struggle when applied to large-scale table collections. The key limitations include: (1) Limited support for complex table relationships; (2) Ineffective retrieval of relevant tables at scale; (3) Inaccurate answer generation. To overcome these limitations, we propose DMRAL, a Decomposition-driven Multi-table Retrieval and Answering framework for MTQA over large-scale table collections, which consists of: (1) constructing a table relationship graph to capture complex relationships among tables; (2) Table-Aligned Question Decomposer and Coverage-Aware Retriever, which jointly enable the effective identification of relevant tables from large-scale corpora by enhancing the question decomposition quality and maximizing the question coverage of retrieved tables; and (3) Sub-question Guided Reasoner, which produces correct answers by progressively generating and refining the reasoning program based on sub-questions. Experiments on two MTQA datasets demonstrate that DMRAL significantly outperforms existing state-of-the-art MTQA methods, with an average improvement of 24% in table retrieval and 55% in answer accuracy.
翻译:本文研究面向大规模表格集合(如在线数据仓库)的数值多表问答问题。该任务在许多分析应用中至关重要。现有的多表问答解决方案,如文本到SQL或开放域多表问答方法,专为数据库设计,在应用于大规模表格集合时面临困难。主要局限性包括:(1)对复杂表格关系的支持有限;(2)大规模相关表格检索效率低下;(3)答案生成不准确。为克服这些局限,我们提出DMRAL——一种面向大规模表格集合的分解驱动多表检索与回答框架,其包含:(1)构建表格关系图以捕获表格间的复杂关系;(2)表格对齐问题分解器与覆盖感知检索器,通过提升问题分解质量并最大化检索表格对问题的覆盖度,共同实现从大规模语料库中有效识别相关表格;(3)子问题引导推理器,基于子问题逐步生成并优化推理程序以产生正确答案。在两个多表问答数据集上的实验表明,DMRAL显著优于现有最先进的多表问答方法,在表格检索和答案准确率上分别平均提升24%和55%。