Aggregating experimental data from papers enables materials scientists to build better property prediction models and to facilitate scientific discovery. Recently, interest has grown in extracting not only single material properties but also entire experimental measurements. To support this shift, we introduce LitXBench, a framework for benchmarking methods that extract experiments from literature. We also present LitXAlloy, a dense benchmark comprising 1426 total measurements from 19 alloy papers. By storing the benchmark's entries as Python objects, rather than text-based formats such as CSV or JSON, we improve auditability and enable programmatic data validation. We find that frontier language models, such as Gemini 3.1 Pro Preview, outperform existing multi-turn extraction pipelines by up to 0.37 F1. Our results suggest that this performance gap arises because extraction pipelines associate measurements with compositions rather than the processing steps that define a material.
翻译:从论文中聚合实验数据,使得材料科学家能够构建更好的性质预测模型并促进科学发现。近年来,学术界不仅关注提取单一材料性质,对提取完整实验测量数据的兴趣也日益增长。为支持这一研究转向,我们提出了LitXBench,一个用于评测从文献中提取实验方法的基准框架。同时我们发布了LitXAlloy,一个包含来自19篇合金论文共1426项测量数据的密集基准数据集。通过将基准条目存储为Python对象而非CSV或JSON等文本格式,我们提升了可审计性并实现了可编程数据验证。研究发现,前沿语言模型(如Gemini 3.1 Pro预览版)相较现有多轮提取流程的性能最高提升0.37 F1值。结果表明,这一性能差距源于现有提取流程将测量值与成分而非定义材料的加工步骤相关联。