Literature review tables are essential for summarizing and comparing collections of scientific papers. In this paper, we study the automatic generation of such tables from a pool of papers to satisfy a user's information need. Building on recent work (Newman et al., 2024), we move beyond oracle settings by (i) simulating well-specified yet schema-agnostic user demands that avoid leaking gold column names or values, (ii) explicitly modeling retrieval noise via semantically related but out-of-scope distractor papers verified by human annotators, and (iii) introducing a lightweight, annotation-free, utilization-oriented evaluation that decomposes utility into schema coverage, unary cell fidelity, and pairwise relational consistency, while measuring paper selection through a two-way QA procedure (gold to system and system to gold) with recall, precision, and F1. To support reproducible evaluation, we introduce arXiv2Table, a benchmark of 1,957 tables referencing 7,158 papers, with human-verified distractors and rewritten, schema-agnostic user demands. We also develop an iterative, batch-based generation method that co-refines paper filtering and schema over multiple rounds. We validate the evaluation protocol with human audits and cross-evaluator checks. Extensive experiments show that our method consistently improves over strong baselines, while absolute scores remain modest, underscoring the task's difficulty. Our data and code is available at https://github.com/JHU-CLSP/arXiv2Table.
翻译:文献综述表格对于总结和比较科学论文集合至关重要。本文研究了从一组论文中自动生成此类表格以满足用户信息需求的方法。基于近期工作(Newman 等人,2024),我们通过以下方式超越了理想化设定:(i) 模拟明确但无模式约束的用户需求,避免泄露黄金标准列名或数值;(ii) 通过人工标注验证的语义相关但超出范围的干扰论文,显式建模检索噪声;(iii) 引入一种轻量级、免标注、面向实用性的评估方法,将效用分解为模式覆盖率、单元素元组保真度和成对关系一致性,并通过双向问答流程(黄金标准到系统、系统到黄金)结合召回率、精确率和F1值来衡量论文选择。为支持可复现评估,我们提出了arXiv2Table基准,包含1957个表格、引用7158篇论文,并配有经人工验证的干扰论文和重写后的无模式约束用户需求。我们还开发了一种迭代式批处理方法,通过多轮协作优化论文过滤与模式构建。我们通过人工审核和跨评估器校验验证了评估协议。广泛实验表明,我们的方法在强基线上持续提升,但绝对分数仍较低,凸显了该任务的难度。我们的数据和代码已开源至 https://github.com/JHU-CLSP/arXiv2Table。