Document-based question answering (QA) increasingly includes abstract questions that require synthesizing scattered information from long documents or across multiple documents into coherent answers. However, this setting is still poorly supported by existing benchmarks and evaluation methods, which often lack stable abstract references or rely on coarse similarity metrics and unstable head-to-head comparisons. To alleviate this issue, we introduce ASTRA-QA, a benchmark for AbSTRAct Question Answering over documents. ASTRA-QA contains 869 QA instances over academic papers and news documents, covering five abstract question types and three controlled retrieval scopes. Each instance is equipped with explicit evaluation annotations, including answer topic sets, curated unsupported topics, and aligned evidence. Building on these annotations, ASTRA-QA assesses whether answers cover required key points and avoid unsupported content by directly scoring topic coverage and curated unsupported content, enabling scalable evaluation without exhaustive head-to-head comparisons. Experiments with representative Retrieval-Augmented Generation (RAG) methods spanning vanilla, graph-based, and hierarchical retrieval settings show that ASTRA-QA provides reference-grounded diagnostics for coverage, hallucination, and retrieval-scope robustness. Our dataset and code are available at https://xinyangsally.github.io/astra-benchmark.
翻译:基于文档的问答(QA)日益涉及抽象类问题,要求将长文档或跨文档中分散的信息综合为连贯的答案。然而,现有基准和评估方法对此设置支持不足,往往缺乏稳定的抽象参考答案,或依赖粗粒度的相似度指标以及不稳定的成对比较。为解决此问题,我们提出ASTRA-QA——面向文档的AbSTRAct问答基准。ASTRA-QA包含869个基于学术论文和新闻文档的问答实例,涵盖五种抽象问题类型和三种受控检索范围。每个实例均配有显式评估标注,包括答案主题集、策划的非支持主题及对齐证据。基于这些标注,ASTRA-QA通过直接评分主题覆盖度与策划的非支持内容,评估答案是否覆盖所需关键点并避免非支持内容,从而无需穷举成对比较即可实现可扩展评估。针对包含原生检索、基于图检索及分层检索设置的典型检索增强生成(RAG)方法的实验表明,ASTRA-QA能为覆盖度、幻觉及检索范围鲁棒性提供基于参考的诊断分析。我们的数据集与代码已发布于https://xinyangsally.github.io/astra-benchmark。