Language models, potentially augmented with tool usage such as retrieval are becoming the go-to means of answering questions. Understanding and answering questions in real-world settings often requires retrieving information from different sources, processing and aggregating data to extract insights, and presenting complex findings in form of structured artifacts such as novel tables, charts, or infographics. In this paper, we introduce TANQ, the first open domain question answering dataset where the answers require building tables from information across multiple sources. We release the full source attribution for every cell in the resulting table and benchmark state-of-the-art language models in open, oracle, and closed book setups. Our best-performing baseline, GPT4 reaches an overall F1 score of 29.1, lagging behind human performance by 19.7 points. We analyse baselines' performance across different dataset attributes such as different skills required for this task, including multi-hop reasoning, math operations, and unit conversions. We further discuss common failures in model-generated answers, suggesting that TANQ is a complex task with many challenges ahead.
翻译:语言模型(可能结合检索等工具增强)正逐渐成为回答问题的首选方法。在现实场景中理解并回答问题通常需要从不同来源检索信息、处理和聚合数据以提取洞察,并以新型表格、图表或信息图等结构化制品的形式呈现复杂发现。本文介绍了TANQ——首个开放域问答数据集,其答案要求从多个来源整合信息构建表格。我们为结果表格中的每个单元格提供完整来源归因,并在开放、预言机(oracle)和封闭书(closed book)设置下对最先进语言模型进行基准测试。最佳基线模型GPT4的总体F1分数为29.1,落后人类表现19.7个百分点。我们分析了基线模型在不同数据集属性(如多跳推理、数学运算和单位换算等任务所需技能)上的表现,进一步讨论了模型生成答案中的常见错误,表明TANQ是一项充满诸多挑战的复杂任务。