CTINEXUS: Leveraging Optimized LLM In-Context Learning for Constructing Cybersecurity Knowledge Graphs Under Data Scarcity

Textual descriptions in cyber threat intelligence (CTI) reports, such as security articles and news, are rich sources of knowledge about cyber threats, crucial for organizations to stay informed about the rapidly evolving threat landscape. However, current CTI extraction methods lack flexibility and generalizability, often resulting in inaccurate and incomplete knowledge extraction. Syntax parsing relies on fixed rules and dictionaries, while model fine-tuning requires large annotated datasets, making both paradigms challenging to adapt to new threats and ontologies. To bridge the gap, we propose CTINexus, a novel framework leveraging optimized in-context learning (ICL) of large language models (LLMs) for data-efficient CTI knowledge extraction and high-quality cybersecurity knowledge graph (CSKG) construction. Unlike existing methods, CTINexus requires neither extensive data nor parameter tuning and can adapt to various ontologies with minimal annotated examples. This is achieved through (1) a carefully designed automatic prompt construction strategy with optimal demonstration retrieval for extracting a wide range of cybersecurity entities and relations; (2) a hierarchical entity alignment technique that canonicalizes the extracted knowledge and removes redundancy; (3) an ICL-enhanced long-distance relation prediction technique to further complete the CKSG with missing links. Our extensive evaluations using 150 real-world CTI reports collected from 10 platforms demonstrate that CTINexus significantly outperforms existing methods in constructing accurate and complete CSKGs, highlighting its potential to transform CTI analysis with an efficient and adaptable solution for the dynamic threat landscape.

翻译：网络威胁情报（CTI）报告（如安全文章和新闻）中的文本描述是关于网络威胁知识的丰富来源，对于组织及时了解快速演变的威胁态势至关重要。然而，当前的CTI提取方法缺乏灵活性和泛化能力，常常导致知识提取不准确且不完整。句法解析依赖于固定的规则和词典，而模型微调则需要大量标注数据集，这两种范式都难以适应新的威胁和本体。为弥补这一差距，我们提出了CTINexus，这是一个新颖的框架，它利用大型语言模型（LLMs）优化的上下文学习（ICL）来实现数据高效的CTI知识提取和高质量的网络安全知识图谱（CSKG）构建。与现有方法不同，CTINexus既不需要大量数据，也无需进行参数调优，并且能够以最少的标注示例适应各种本体。这是通过以下方式实现的：（1）精心设计的自动提示构建策略，结合最优示例检索，以提取广泛的网络安全实体和关系；（2）分层实体对齐技术，用于规范化提取的知识并消除冗余；（3）ICL增强的长距离关系预测技术，以通过补全缺失链接来进一步完善CKSG。我们使用从10个平台收集的150份真实世界CTI报告进行的广泛评估表明，CTINexus在构建准确且完整的CSKG方面显著优于现有方法，突显了其通过高效且适应性强的解决方案来变革CTI分析、应对动态威胁态势的潜力。