In this study, we propose a staging area for ingesting new superconductors' experimental data in SuperCon that is machine-collected from scientific articles. Our objective is to enhance the efficiency of updating SuperCon while maintaining or enhancing the data quality. We present a semi-automatic staging area driven by a workflow combining automatic and manual processes on the extracted database. An anomaly detection automatic process aims to pre-screen the collected data. Users can then manually correct any errors through a user interface tailored to simplify the data verification on the original PDF documents. Additionally, when a record is corrected, its raw data is collected and utilised to improve machine learning models as training data. Evaluation experiments demonstrate that our staging area significantly improves curation quality. We compare the interface with the traditional manual approach of reading PDF documents and recording information in an Excel document. Using the interface boosts the precision and recall by 6% and 50%, respectively to an average increase of 40% in F1-score.
翻译:在本研究中,我们提出了一种暂存区域,用于将机器从科学文章中收集的新型超导体实验数据导入SuperCon数据库。我们的目标是在维持或提升数据质量的同时,提高SuperCon的更新效率。我们提出了一种由自动流程与手动流程相结合的工作流驱动的半自动化暂存区域,该工作流作用于提取的数据库。异常检测自动流程旨在对收集的数据进行预筛选。随后,用户可通过一个专为简化基于原始PDF文档的数据验证而设计的用户界面手动纠正任何错误。此外,当一条记录被纠正后,其原始数据将被收集并用作训练数据以改进机器学习模型。评估实验表明,我们的暂存区域显著提升了整理质量。我们将该界面与传统的阅读PDF文档并在Excel文档中记录信息的手动方法进行对比。使用该界面使精确率和召回率分别提升了6%和50%,F1分数平均提升了40%。