In this paper, we establish a benchmark for table visual question answering, referred to as the TableVQA-Bench, derived from pre-existing table question-answering (QA) and table structure recognition datasets. It is important to note that existing datasets have not incorporated images or QA pairs, which are two crucial components of TableVQA. As such, the primary objective of this paper is to obtain these necessary components. Specifically, images are sourced either through the application of a \textit{stylesheet} or by employing the proposed table rendering system. QA pairs are generated by exploiting the large language model (LLM) where the input is a text-formatted table. Ultimately, the completed TableVQA-Bench comprises 1,500 QA pairs. We comprehensively compare the performance of various multi-modal large language models (MLLMs) on TableVQA-Bench. GPT-4V achieves the highest accuracy among commercial and open-sourced MLLMs from our experiments. Moreover, we discover that the number of vision queries plays a significant role in TableVQA performance. To further analyze the capabilities of MLLMs in comparison to their LLM backbones, we investigate by presenting image-formatted tables to MLLMs and text-formatted tables to LLMs, respectively. Our findings suggest that processing visual inputs is more challenging than text inputs, as evidenced by the lower performance of MLLMs, despite generally requiring higher computational costs than LLMs. The proposed TableVQA-Bench and evaluation codes are available at \href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}.
翻译:本文构建了一个面向表格视觉问答的基准测试(TableVQA-Bench),其数据源自现有的表格问答(QA)与表格结构识别数据集。需指出的是,现有数据集尚未包含图像和问答对这两个TableVQA的关键组成部分。因此,本文的首要目标是获取这些必要组件。具体而言,图像通过应用\textit{样式表}或采用所提出的表格渲染系统生成;问答对则利用大语言模型(LLM)以文本格式表格作为输入生成。最终完成的TableVQA-Bench包含1,500个问答对。我们全面比较了多种多模态大语言模型(MLLM)在TableVQA-Bench上的性能。实验表明,GPT-4V在商业和开源MLLM中取得了最高准确率。此外,我们发现视觉查询数量对TableVQA性能有显著影响。为进一步分析MLLM与其LLM主干网络的能力差异,我们分别向MLLM呈现图像格式表格、向LLM呈现文本格式表格进行探究。研究结果表明,处理视觉输入比文本输入更具挑战性——尽管MLLM通常需要比LLM更高的计算成本,但其性能表现却较低。所提出的TableVQA-Bench及评估代码已开源至\href{https://github.com/naver-ai/tablevqabench}{https://github.com/naver-ai/tablevqabench}。