The automatic extraction of information from Cyber Threat Intelligence (CTI) reports is crucial in risk management. The increased frequency of the publications of these reports has led researchers to develop new systems for automatically recovering different types of entities and relations from textual data. Most state-of-the-art models leverage Natural Language Processing (NLP) techniques, which perform greatly in extracting a few types of entities at a time but cannot detect heterogeneous data or their relations. Furthermore, several paradigms, such as STIX, have become de facto standards in the CTI community and dictate a formal categorization of different entities and relations to enable organizations to share data consistently. This paper presents STIXnet, the first solution for the automated extraction of all STIX entities and relationships in CTI reports. Through the use of NLP techniques and an interactive Knowledge Base (KB) of entities, our approach obtains F1 scores comparable to state-of-the-art models for entity extraction (0.916) and relation extraction (0.724) while considering significantly more types of entities and relations. Moreover, STIXnet constitutes a modular and extensible framework that manages and coordinates different modules to merge their contributions uniquely and exhaustively. With our approach, researchers and organizations can extend their Information Extraction (IE) capabilities by integrating the efforts of several techniques without needing to develop new tools from scratch.
翻译:从网络威胁情报(CTI)报告中自动提取信息对于风险管理至关重要。随着此类报告发布频率的增加,研究人员开始开发新系统,以从文本数据中自动恢复不同类型实体及其关系。当前大多数先进模型采用自然语言处理(NLP)技术,这类技术虽能高效提取少数实体类型,但无法检测异构数据或实体间关系。此外,STIX等规范已成为CTI领域事实标准,其定义了实体和关系的正式分类体系,使各组织能够以统一方式共享数据。本文提出STIXnet——首个能自动提取CTI报告中所有STIX实体及关系的解决方案。通过结合NLP技术与交互式实体知识库(KB),本方法在实体提取(F1分数0.916)和关系提取(F1分数0.724)上达到了与先进模型相当的性能,同时覆盖了显著更多的实体与关系类型。此外,STIXnet构建了一个模块化、可扩展的框架,能协调各模块并唯一性、全面性地整合其贡献。借助本方法,研究人员和组织无需从零开发新工具,即可通过融合多种技术的能力来扩展其信息提取(IE)功能。