Retrieval-Augmented Generation (RAG) has demonstrated considerable effectiveness in open-domain question answering. However, when applied to heterogeneous documents, comprising both textual and tabular components, existing RAG approaches exhibit critical limitations. The prevailing practice of flattening tables and chunking strategies disrupts the intrinsic tabular structure, leads to information loss, and undermines the reasoning capabilities of LLMs in multi-hop, global queries. To address these challenges, we propose TableRAG, an SQL-based framework that unifies textual understanding and complex manipulations over tabular data. TableRAG iteratively operates in four steps: context-sensitive query decomposition, text retrieval, SQL programming and execution, and compositional intermediate answer generation. We also develop HeteQA, a novel benchmark designed to evaluate the multi-hop heterogeneous reasoning capabilities. Experimental results demonstrate that TableRAG consistently outperforms existing baselines on both public datasets and our HeteQA, establishing a new state-of-the-art for heterogeneous document question answering. We release TableRAG at https://github.com/yxh-y/TableRAG/tree/main.
翻译:检索增强生成(RAG)在开放域问答任务中已展现出显著成效。然而,当应用于包含文本和表格组件的异构文档时,现有RAG方法存在关键局限性。当前主流的表格扁平化处理和分块策略破坏了固有的表格结构,导致信息丢失,并削弱了大语言模型在多跳、全局查询中的推理能力。为应对这些挑战,我们提出了TableRAG——一个基于SQL的框架,它统一了对文本的理解和对表格数据的复杂操作。TableRAG迭代执行四个步骤:上下文感知的查询分解、文本检索、SQL编程与执行,以及组合式中间答案生成。我们还开发了HeteQA,这是一个旨在评估多跳异构推理能力的新型基准测试。实验结果表明,TableRAG在公共数据集和我们的HeteQA上均持续优于现有基线方法,为异构文档问答树立了新的技术标杆。我们在 https://github.com/yxh-y/TableRAG/tree/main 发布了TableRAG。