LitXBench: A Benchmark for Extracting Experiments from Scientific Literature

Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.

翻译：从论文中聚合实验数据，使得材料科学家能够构建更好的性质预测模型并促进科学发现。近年来，学术界不仅关注提取单一材料性质，对提取完整实验测量数据的兴趣也日益增长。为支持这一研究转向，我们提出了LitXBench，一个用于评测从文献中提取实验方法的基准框架。同时我们发布了LitXAlloy，一个包含来自19篇合金论文共1426项测量数据的密集基准数据集。通过将基准条目存储为Python对象而非CSV或JSON等文本格式，我们提升了可审计性并实现了可编程数据验证。研究发现，前沿语言模型（如Gemini 3.1 Pro预览版）相较现有多轮提取流程的性能最高提升0.37 F1值。结果表明，这一性能差距源于现有提取流程将测量值与成分而非定义材料的加工步骤相关联。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【AAAI2024】Wikiformer: 利用维基百科结构化信息进行预训练，用于Ad-hoc检索

专知会员服务

19+阅读 · 2023年12月26日

事件抽取的再评价:过去、现在和未来的挑战

专知会员服务

25+阅读 · 2023年11月28日

评估ChatGPT的信息提取能力:对性能、可解释性、校准和忠实度的评估

专知会员服务

77+阅读 · 2023年4月26日

北航《深度学习事件抽取》文献综述和当前趋势

专知会员服务

87+阅读 · 2021年7月6日