Recent advancements in language models (LMs) have notably enhanced their ability to reason with tabular data, primarily through program-aided mechanisms that manipulate and analyze tables. However, these methods often require the entire table as input, leading to scalability challenges due to the positional bias or context length constraints. In response to these challenges, we introduce TableRAG, a Retrieval-Augmented Generation (RAG) framework specifically designed for LM-based table understanding. TableRAG leverages query expansion combined with schema and cell retrieval to pinpoint crucial information before providing it to the LMs. This enables more efficient data encoding and precise retrieval, significantly reducing prompt lengths and mitigating information loss. We have developed two new million-token benchmarks from the Arcade and BIRD-SQL datasets to thoroughly evaluate TableRAG's effectiveness at scale. Our results demonstrate that TableRAG's retrieval design achieves the highest retrieval quality, leading to the new state-of-the-art performance on large-scale table understanding.
翻译:近年来,语言模型在表格数据推理能力方面取得了显著进展,这主要得益于通过程序辅助机制对表格进行操作与分析的方法。然而,这些方法通常需要将完整表格作为输入,由于位置偏差或上下文长度限制而面临可扩展性挑战。为应对这些挑战,我们提出了TableRAG——一个专为基于语言模型的表格理解而设计的检索增强生成框架。TableRAG通过结合查询扩展与模式及单元格检索,在将关键信息提供给语言模型前实现精准定位。该机制实现了更高效的数据编码与精确检索,显著缩短提示长度并减少信息损失。我们基于Arcade和BIRD-SQL数据集构建了两个全新的百万令牌级基准测试,以全面评估TableRAG在大规模场景下的效能。实验结果表明,TableRAG的检索设计实现了最优的检索质量,从而在大规模表格理解任务中创造了新的最先进性能。