Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
翻译:近年来,语言模型在表格数据推理能力方面取得了显著进展,这主要得益于通过程序辅助机制对表格进行操作和分析。然而,这些方法通常需要将整个表格作为输入,由于位置偏差或上下文长度限制,导致了可扩展性挑战。为应对这些挑战,我们提出了TableRAG,一个专为基于语言模型的表格理解而设计的检索增强生成框架。TableRAG利用查询扩展结合模式与单元格检索,在将信息提供给语言模型之前精确定位关键内容。这实现了更高效的数据编码与精确检索,显著缩短了提示长度并减轻了信息损失。我们基于Arcade和BIRD-SQL数据集开发了两个新的百万令牌级基准测试,以全面评估TableRAG在大规模场景下的有效性。实验结果表明,TableRAG的检索设计实现了最高的检索质量,从而在大规模表格理解任务上取得了新的最先进性能。