Advanced omics technologies and facilities generate a wealth of valuable data daily; however, the data often lacks the essential metadata required for researchers to find and search them effectively. The lack of metadata poses a significant challenge in the utilization of these datasets. Machine learning-based metadata extraction techniques have emerged as a potentially viable approach to automatically annotating scientific datasets with the metadata necessary for enabling effective search. Text labeling, usually performed manually, plays a crucial role in validating machine-extracted metadata. However, manual labeling is time-consuming; thus, there is an need to develop automated text labeling techniques in order to accelerate the process of scientific innovation. This need is particularly urgent in fields such as environmental genomics and microbiome science, which have historically received less attention in terms of metadata curation and creation of gold-standard text mining datasets. In this paper, we present two novel automated text labeling approaches for the validation of ML-generated metadata for unlabeled texts, with specific applications in environmental genomics. Our techniques show the potential of two new ways to leverage existing information about the unlabeled texts and the scientific domain. The first technique exploits relationships between different types of data sources related to the same research study, such as publications and proposals. The second technique takes advantage of domain-specific controlled vocabularies or ontologies. In this paper, we detail applying these approaches for ML-generated metadata validation. Our results show that the proposed label assignment approaches can generate both generic and highly-specific text labels for the unlabeled texts, with up to 44% of the labels matching with those suggested by a ML keyword extraction algorithm.
翻译:先进的组学技术和设施每日生成大量宝贵数据,但这些数据常因缺乏研究人员高效检索所需的关键元数据而受限。元数据的缺失严重阻碍了这些数据集的开发利用。基于机器学习的元数据抽取技术已成为自动为科学数据集标注必要元数据、实现高效检索的潜在可行方案。文本标注(通常由人工完成)在验证机器提取的元数据中具有关键作用。然而,人工标注耗时费力,亟需开发自动化文本标注技术以加速科学创新进程。这一需求在环境基因组学和微生物组科学等领域尤为迫切——这些领域在元数据整理与黄金标准文本挖掘数据集构建方面长期缺乏关注。本文提出两种新型自动化文本标注方法,可用于验证未标注文本的机器学习生成元数据,并具体应用于环境基因组学领域。我们的技术展示了利用未标注文本及科学领域现有信息的两种创新路径:第一种方法挖掘同一科研项目(如论文与项目申请书)中不同类型数据源之间的关联关系;第二种方法则利用领域特定的受控词表或本体。本文详细阐述了如何将这些方法应用于机器学习生成元数据的验证。实验结果表明,所提出的标签分配方法能够为未标注文本生成通用型及高特异性文本标签,其中高达44%的标签与机器学习关键短语抽取算法建议的标签相匹配。