Semi-structured data, such as Infobox tables, often include temporal information about entities, either implicitly or explicitly. Can current NLP systems reason about such information in semi-structured tables? To tackle this question, we introduce the task of temporal question answering on semi-structured tables. We present a dataset, TempTabQA, which comprises 11,454 question-answer pairs extracted from 1,208 Wikipedia Infobox tables spanning more than 90 distinct domains. Using this dataset, we evaluate several state-of-the-art models for temporal reasoning. We observe that even the top-performing LLMs lag behind human performance by more than 13.5 F1 points. Given these results, our dataset has the potential to serve as a challenging benchmark to improve the temporal reasoning capabilities of NLP models.
翻译:半结构化的数据(如信息框表格)通常包含关于实体的时间信息,无论是隐含的还是显式的。当前的NLP系统能否对半结构化表格中的此类信息进行推理?为应对这一问题,我们提出了半结构化表格上的时间问答任务。我们构建了一个数据集TempTabQA,该数据集包含从1,208张维基百科信息框表格中提取的11,454个问答对,涵盖90多个不同领域。利用此数据集,我们评估了多个最先进模型的时间推理能力。观察到即使是最优的大型语言模型(LLM),其F1得分仍比人类表现低13.5分以上。基于这些结果,我们的数据集有望成为一个具有挑战性的基准,用于提升NLP模型的时间推理能力。