HCT-QA: A Benchmark for Question Answering on Human-Centric Tables

Tabular data embedded in PDF files, web pages, and other types of documents is prevalent in various domains. These tables, which we call human-centric tables (HCTs for short), are dense in information but often exhibit complex structural and semantic layouts. To query these HCTs, some existing solutions focus on transforming them into relational formats. However, they fail to handle the diverse and complex layouts of HCTs, making them not amenable to easy querying with SQL-based approaches. Another emerging option is to use Large Language Models (LLMs) and Vision Language Models (VLMs). However, there is a lack of standard evaluation benchmarks to measure and compare the performance of models to query HCTs using natural language. To address this gap, we propose the HumanCentric Tables Question-Answering extensive benchmark (HCTQA) consisting of thousands of HCTs with several thousands of natural language questions with their respective answers. More specifically, HCT-QA includes 1,880 real-world HCTs with 9,835 QA pairs in addition to 4,679 synthetic HCTs with 67.7K QA pairs. Also, we show through extensive experiments the performance of 25 and 9 different LLMS and VLMs, respectively, in an answering HCT-QA's questions. In addition, we show how finetuning an LLM on HCT-QA improves F1 scores by up to 25 percentage points compared to the off-the-shelf model. Compared to existing benchmarks, HCT-QA stands out for its broad complexity and diversity of covered HCTs and generated questions, its comprehensive metadata enabling deeper insight and analysis, and its novel synthetic data and QA generator.

翻译：嵌入在PDF文件、网页及其他类型文档中的表格数据在各领域普遍存在。这些表格，我们称之为以人为中心的表格（简称HCTs），信息密集但通常呈现复杂的结构和语义布局。为查询这些HCTs，现有的一些解决方案侧重于将其转换为关系型格式。然而，它们无法处理HCTs多样且复杂的布局，使得基于SQL的方法难以进行便捷查询。另一种新兴方案是使用大型语言模型（LLMs）和视觉语言模型（VLMs）。但目前缺乏标准评估基准来衡量和比较模型使用自然语言查询HCTs的性能。为填补这一空白，我们提出了HumanCentric Tables Question-Answering大规模基准（HCT-QA），包含数千个HCTs及数万个带对应答案的自然语言问题。具体而言，HCT-QA包含1,880个真实世界HCTs（附带9,835个问答对）以及4,679个合成HCTs（附带67.7K个问答对）。此外，我们通过大量实验展示了25种不同LLMs和9种不同VLMs在回答HCT-QA问题上的性能。同时，我们证明了在HCT-QA上微调LLM相比现成模型可将F1分数提升高达25个百分点。与现有基准相比，HCT-QA的突出优势在于：所涵盖HCTs及其生成问题的广泛复杂性与多样性、支持深入洞察分析的全面元数据，以及其创新的合成数据与问答生成器。