Tabular data is frequently captured in image form across a wide range of real-world scenarios such as financial reports, handwritten records, and document scans. These visual representations pose unique challenges for machine understanding, as they combine both structural and visual complexities. While recent advances in Multimodal Large Language Models (MLLMs) show promising results in table understanding, they typically assume the relevant table is readily available. However, a more practical scenario involves identifying and reasoning over relevant tables from large-scale collections to answer user queries. To address this gap, we propose TabRAG, a framework that enables MLLMs to answer queries over large collections of table images. Our approach first retrieves candidate tables using jointly trained visual-text foundation models, then leverages MLLMs to perform fine-grained reranking of these candidates, and finally employs MLLMs to reason over the selected tables for answer generation. Through extensive experiments on a newly constructed dataset comprising 88,161 training and 9,819 testing samples across 8 benchmarks with 48,504 unique tables, we demonstrate that our framework significantly outperforms existing methods by 7.0% in retrieval recall and 6.1% in answer accuracy, offering a practical solution for real-world table understanding tasks.
翻译:表格数据在现实场景中(如财务报表、手写记录和文档扫描)常以图像形式呈现。这些视觉表征对机器理解提出了独特挑战,因为它们同时包含结构复杂性和视觉复杂性。尽管多模态大语言模型在表格理解方面展现出潜力,现有研究通常假设相关表格已预先给定。然而更实际的场景需要从大规模表格集合中识别并推理相关表格以回答用户查询。为填补这一空白,我们提出TabRAG框架,使多模态大语言模型能够基于大规模表格图像集合进行问答。该方法首先通过联合训练的视觉-文本基础模型检索候选表格,随后利用多模态大语言模型进行细粒度重排序,最终通过多模态大语言模型对选定表格进行推理以生成答案。我们在新构建的数据集上进行了广泛实验(该数据集包含8个基准测试中的88,161个训练样本和9,819个测试样本,涵盖48,504个独立表格),实验表明我们的框架在检索召回率上超越现有方法7.0%,在答案准确率上提升6.1%,为现实世界的表格理解任务提供了实用解决方案。