SPARTA: Scalable and Principled Benchmark of Tree-Structured Multi-hop QA over Text and Tables

Real-world Table-Text question answering (QA) tasks require models that can reason across long text and source tables, traversing multiple hops and executing complex operations such as aggregation. Yet existing benchmarks are small, manually curated - and therefore error-prone - and contain shallow questions that seldom demand more than two hops or invoke aggregations, grouping, or other advanced analytical operations expressible in natural-language queries. We present SPARTA, an end-to-end construction framework that automatically generates large-scale Table-Text QA benchmarks with lightweight human validation, requiring only one quarter of the annotation time of HybridQA. The framework first constructs a reference fact database by enriching each source table with grounding tables whose tuples are atomic facts automatically extracted from the accompanying unstructured passages, then synthesizes nested queries whose number of nested predicates matches the desired hop count. To ensure that every SQL statement is executable and that its verbalization yields a fluent, human-sounding question, we propose two novel techniques: provenance-based refinement, which rewrites any syntactically valid query that returns a non-empty result, and realistic-structure enforcement, which confines generation to post-order traversals of the query graph. The resulting pipeline produces thousands of high-fidelity question-answer pairs covering aggregations, grouping, and deep multi-hop reasoning across text and tables. On SPARTA, state-of-the-art models that reach over 70 F1 on HybridQA or over 50 F1 on OTT-QA drop by more than 30 F1 points, exposing fundamental weaknesses in current cross-modal reasoning. Our benchmark, construction code, and baseline models are available at https://github.com/pshlego/SPARTA/tree/main.

翻译：现实世界中的表格-文本问答任务需要模型能够跨越长文本和源表格进行推理，遍历多个跳转并执行诸如聚合等复杂操作。然而，现有基准测试规模较小、依赖人工构建——因此容易出错——且包含的问题较为浅显，很少需要超过两跳或涉及聚合、分组或其他可通过自然语言查询表达的高级分析操作。我们提出了SPARTA，一个端到端的构建框架，该框架通过轻量级人工验证自动生成大规模表格-文本问答基准测试，其标注时间仅需HybridQA的四分之一。该框架首先通过用基础表格丰富每个源表来构建参考事实数据库，这些基础表格的元组是从伴随的非结构化段落中自动提取的原子事实；然后合成嵌套查询，其嵌套谓词的数量与期望的跳数相匹配。为确保每个SQL语句可执行且其语言化表达能产生流畅、符合人类表达习惯的问题，我们提出了两种新技术：基于溯源的细化，它重写任何返回非空结果的语法有效查询；以及真实结构强制，它将生成过程限制在查询图的后序遍历范围内。由此产生的流水线生成了数千个高保真度的问题-答案对，涵盖聚合、分组以及跨文本和表格的深度多跳推理。在SPARTA上，在HybridQA上达到超过70 F1分数或在OTT-QA上超过50 F1分数的先进模型，其性能下降了超过30个F1点，这揭示了当前跨模态推理的根本性弱点。我们的基准测试、构建代码和基线模型可在 https://github.com/pshlego/SPARTA/tree/main 获取。