Much of human knowledge in cybersecurity is encapsulated within the ever-growing volume of scientific papers. As this textual data continues to expand, the importance of document organization methods becomes increasingly crucial for extracting actionable insights hidden within large text datasets. Knowledge Graphs (KGs) serve as a means to store factual information in a structured manner, providing explicit, interpretable knowledge that includes domain-specific information from the cybersecurity scientific literature. One of the challenges in constructing a KG from scientific literature is the extraction of ontology from unstructured text. In this paper, we address this topic and introduce a method for building a multi-modal KG by extracting structured ontology from scientific papers. We demonstrate this concept in the cybersecurity domain. One modality of the KG represents observable information from the papers, such as the categories in which they were published or the authors. The second modality uncovers latent (hidden) patterns of text extracted through hierarchical and semantic non-negative matrix factorization (NMF), such as named entities, topics or clusters, and keywords. We illustrate this concept by consolidating more than two million scientific papers uploaded to arXiv into the cyber-domain, using hierarchical and semantic NMF, and by building a cyber-domain-specific KG.
翻译:网络安全领域的人类知识大量蕴含在不断增长的科学论文中。随着这类文本数据的持续扩展,文档组织方法对于从海量文本数据集中提取可操作洞察的重要性日益凸显。知识图谱以结构化方式存储事实信息,提供显式、可解释的知识,其中包含来自网络安全科学文献的领域特定信息。从科学文献构建知识图谱的挑战之一是从非结构化文本中提取本体。本文针对这一问题,提出了一种通过从科学论文中提取结构化本体来构建多模态知识图谱的方法。我们在网络安全领域验证了这一概念。知识图谱的一种模态代表论文中的可观测信息,例如发表类别或作者。第二种模态通过层次化和语义非负矩阵分解揭示文本中提取的潜在(隐藏)模式,如命名实体、主题或聚类以及关键词。我们通过层次化和语义非负矩阵分解整合了arXiv上超过两百万篇科学论文至网络安全领域,并构建了网络安全领域特定知识图谱,从而验证了这一概念。