The automatic extraction of information from Cyber Threat Intelligence (CTI) reports is crucial in risk management. The increased frequency of the publications of these reports has led researchers to develop new systems for automatically recovering different types of entities and relations from textual data. Most state-of-the-art models leverage Natural Language Processing (NLP) techniques, which perform greatly in extracting a few types of entities at a time but cannot detect heterogeneous data or their relations. Furthermore, several paradigms, such as STIX, have become de facto standards in the CTI community and dictate a formal categorization of different entities and relations to enable organizations to share data consistently. This paper presents STIXnet, the first solution for the automated extraction of all STIX entities and relationships in CTI reports. Through the use of NLP techniques and an interactive Knowledge Base (KB) of entities, our approach obtains F1 scores comparable to state-of-the-art models for entity extraction (0.916) and relation extraction (0.724) while considering significantly more types of entities and relations. Moreover, STIXnet constitutes a modular and extensible framework that manages and coordinates different modules to merge their contributions uniquely and exhaustively. With our approach, researchers and organizations can extend their Information Extraction (IE) capabilities by integrating the efforts of several techniques without needing to develop new tools from scratch.
翻译:从网络威胁情报(CTI)报告中自动提取信息在风险管理中至关重要。这些报告发布频率的增加促使研究人员开发新系统,用于从文本数据中自动恢复不同类型的实体和关系。大多数现有模型利用自然语言处理(NLP)技术,这些技术在一次提取少数几种实体类型时表现出色,但无法检测异构数据或其关系。此外,STIX等范式已成为CTI社区的事实标准,规定了不同实体和关系的正式分类,使组织能够一致地共享数据。本文提出了STIXnet,这是首个从CTI报告中自动提取所有STIX实体和关系的解决方案。通过使用NLP技术和交互式实体知识库(KB),我们的方法在实体提取(0.916)和关系提取(0.724)方面的F1分数与现有模型相当,同时考虑的实体和关系类型显著更多。此外,STIXnet构成一个模块化、可扩展的框架,能够管理和协调不同模块,以独特且全面地融合它们的贡献。通过我们的方法,研究人员和组织可以整合多种技术的成果来扩展其信息提取(IE)能力,而无需从头开发新工具。