We propose a semi-automatic staging area for efficiently building an accurate database of experimental physical properties of superconductors from literature, called SuperCon2, to enrich the existing manually-built superconductor database SuperCon. Here we report our curation interface (SuperCon2 Interface) and a workflow managing the state transitions of each examined record, to validate the dataset of superconductors from PDF documents collected using Grobid-superconductors in a previous work. This curation workflow allows both automatic and manual operations, the former contains ``anomaly detection'' that scans new data identifying outliers, and a ``training data collector'' mechanism that collects training data examples based on manual corrections. Such training data collection policy is effective in improving the machine-learning models with a reduced number of examples. For manual operations, the interface (SuperCon2 interface) is developed to increase efficiency during manual correction by providing a smart interface and an enhanced PDF document viewer. We show that our interface significantly improves the curation quality by boosting precision and recall as compared with the traditional ``manual correction''. Our semi-automatic approach would provide a solution for achieving a reliable database with text-data mining of scientific documents.
翻译:我们提出了一种半自动化暂存区方法,用于高效构建基于文献的超导体实验物理性质精确数据库(称为SuperCon2),以丰富现有的手动构建超导体数据库SuperCon。本文报告了我们的策展界面(SuperCon2 Interface)及管理每条已检查记录状态转换的工作流程,用于验证前期使用Grobid-superconductors从PDF文档中收集的超导体数据集。该策展工作流程支持自动与手动操作:自动操作包含“异常检测”功能,可扫描新数据并识别离群值;以及基于手动修正收集训练数据样本的“训练数据收集器”机制。这种训练数据收集策略能够以更少的样本量有效提升机器学习模型性能。针对手动操作,我们开发的SuperCon2界面通过智能交互界面和增强型PDF文档查看器提升了手动修正效率。实验表明,与传统“手动修正”相比,该界面通过提高精确率和召回率显著改善了策展质量。我们的半自动化方法为通过科学文献文本数据挖掘构建可靠数据库提供了可行方案。