Automatic extraction of information from publications is key to making scientific knowledge machine readable at a large scale. The extracted information can, for example, facilitate academic search, decision making, and knowledge graph construction. An important type of information not covered by existing approaches is hyperparameters. In this paper, we formalize and tackle hyperparameter information extraction (HyperPIE) as an entity recognition and relation extraction task. We create a labeled data set covering publications from a variety of computer science disciplines. Using this data set, we train and evaluate BERT-based fine-tuned models as well as five large language models: GPT-3.5, GALACTICA, Falcon, Vicuna, and WizardLM. For fine-tuned models, we develop a relation extraction approach that achieves an improvement of 29% F1 over a state-of-the-art baseline. For large language models, we develop an approach leveraging YAML output for structured data extraction, which achieves an average improvement of 5.5% F1 in entity recognition over using JSON. With our best performing model we extract hyperparameter information from a large number of unannotated papers, and analyze patterns across disciplines. All our data and source code is publicly available at https://github.com/IllDepence/hyperpie
翻译:从出版物中自动提取信息是实现科学知识大规模机器可读的关键。提取的信息可促进学术搜索、决策制定及知识图谱构建等应用。现有方法尚未覆盖超参数这一重要信息类型。本文正式定义并提出超参数信息提取(HyperPIE)任务,将其作为实体识别与关系抽取任务加以解决。我们创建了涵盖计算机科学各学科出版物的标注数据集,基于该数据集训练并评估了基于BERT的微调模型,以及五种大型语言模型:GPT-3.5、GALACTICA、Falcon、Vicuna和WizardLM。针对微调模型,我们提出一种关系抽取方法,相较于现有最优基线实现了29%的F1值提升。针对大型语言模型,我们设计了一种利用YAML输出进行结构化数据提取的方法,在实体识别任务上平均F1值比使用JSON提升5.5%。采用性能最优的模型,我们从大量未标注论文中提取超参数信息,并分析跨学科模式。所有数据及源代码已公开于:https://github.com/IllDepence/hyperpie