Natural language processing (NLP) is a promising approach for analyzing large volumes of climate-change and infrastructure-related scientific literature. However, best-in-practice NLP techniques require large collections of relevant documents (corpus). Furthermore, NLP techniques using machine learning and deep learning techniques require labels grouping the articles based on user-defined criteria for a significant subset of a corpus in order to train the supervised model. Even labeling a few hundred documents with human subject-matter experts is a time-consuming process. To expedite this process, we developed a weak supervision-based NLP approach that leverages semantic similarity between categories and documents to (i) establish a topic-specific corpus by subsetting a large-scale open-access corpus and (ii) generate category labels for the topic-specific corpus. In comparison with a months-long process of subject-matter expert labeling, we assign category labels to the whole corpus using weak supervision and supervised learning in about 13 hours. The labeled climate and NCF corpus enable targeted, efficient identification of documents discussing a topic (or combination of topics) of interest and identification of various effects of climate change on critical infrastructure, improving the usability of scientific literature and ultimately supporting enhanced policy and decision making. To demonstrate this capability, we conduct topic modeling on pairs of climate hazards and NCFs to discover trending topics at the intersection of these categories. This method is useful for analysts and decision-makers to quickly grasp the relevant topics and most important documents linked to the topic.
翻译:自然语言处理(NLP)是分析大量气候变化与基础设施相关科学文献的有效方法。然而,顶尖的NLP技术需要大规模的相关文档集合(语料库)。此外,采用机器学习和深度学习技术的NLP方法需要基于用户定义标准对语料库中具有代表性的子集进行文章分类标注,以训练监督模型。即便仅由领域专家标注数百篇文档也是一项耗时的工作。为加速这一流程,我们开发了一种基于弱监督的NLP方法,利用类别与文档之间的语义相似性来:(i)通过子集划分大规模开放获取语料库建立特定主题语料库,(ii)为特定主题语料库生成类别标签。相较于耗时数月的专家标注流程,我们利用弱监督和监督学习在约13小时内完成了整个语料库的类别标注。标注后的气候与国家关键功能(NCF)语料库能够精准高效地识别讨论特定主题(或主题组合)的文档,并揭示气候变化对关键基础设施的各种影响,从而提升科学文献的可用性,最终为优化政策制定和决策提供支持。为展示这一能力,我们针对气候灾害与NCF的配对组合进行主题建模,发现这些类别交叉领域的热点主题。该方法有助于分析人员和决策者快速掌握相关主题及与之关联的重要文献。