Automated Annotation of Scientific Texts for ML-based Keyphrase Extraction and Validation

Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.

翻译：先进的组学技术和设施每日生成大量宝贵数据，但这些数据常因缺乏研究人员高效检索所需的关键元数据而受限。元数据的缺失严重阻碍了这些数据集的开发利用。基于机器学习的元数据抽取技术已成为自动为科学数据集标注必要元数据、实现高效检索的潜在可行方案。文本标注（通常由人工完成）在验证机器提取的元数据中具有关键作用。然而，人工标注耗时费力，亟需开发自动化文本标注技术以加速科学创新进程。这一需求在环境基因组学和微生物组科学等领域尤为迫切——这些领域在元数据整理与黄金标准文本挖掘数据集构建方面长期缺乏关注。本文提出两种新型自动化文本标注方法，可用于验证未标注文本的机器学习生成元数据，并具体应用于环境基因组学领域。我们的技术展示了利用未标注文本及科学领域现有信息的两种创新路径：第一种方法挖掘同一科研项目（如论文与项目申请书）中不同类型数据源之间的关联关系；第二种方法则利用领域特定的受控词表或本体。本文详细阐述了如何将这些方法应用于机器学习生成元数据的验证。实验结果表明，所提出的标签分配方法能够为未标注文本生成通用型及高特异性文本标签，其中高达44%的标签与机器学习关键短语抽取算法建议的标签相匹配。

相关内容

Automator

关注 5

Automator是苹果公司为他们的Mac OS X系统开发的一款软件。 只要通过点击拖拽鼠标等操作就可以将一系列动作组合成一个工作流，从而帮助你自动的（可重复的）完成一些复杂的工作。Automator还能横跨很多不同种类的程序，包括：查找器、Safari网络浏览器、iCal、地址簿或者其他的一些程序。它还能和一些第三方的程序一起工作，如微软的Office、Adobe公司的Photoshop或者Pixelmator等。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日