In the digital era, threat actors employ sophisticated techniques for which, often, digital traces in the form of textual data are available. Cyber Threat Intelligence~(CTI) is related to all the solutions inherent to data collection, processing, and analysis useful to understand a threat actor's targets and attack behavior. Currently, CTI is assuming an always more crucial role in identifying and mitigating threats and enabling proactive defense strategies. In this context, NLP, an artificial intelligence branch, has emerged as a powerful tool for enhancing threat intelligence capabilities. This survey paper provides a comprehensive overview of NLP-based techniques applied in the context of threat intelligence. It begins by describing the foundational definitions and principles of CTI as a major tool for safeguarding digital assets. It then undertakes a thorough examination of NLP-based techniques for CTI data crawling from Web sources, CTI data analysis, Relation Extraction from cybersecurity data, CTI sharing and collaboration, and security threats of CTI. Finally, the challenges and limitations of NLP in threat intelligence are exhaustively examined, including data quality issues and ethical considerations. This survey draws a complete framework and serves as a valuable resource for security professionals and researchers seeking to understand the state-of-the-art NLP-based threat intelligence techniques and their potential impact on cybersecurity.
翻译:数字时代,威胁行为者采用复杂技术,其数字痕迹往往以文本数据形式存在。网络威胁情报(CTI)涉及数据收集、处理和分析的所有解决方案,有助于理解威胁行为者的攻击目标和行为模式。当前,CTI在识别和缓解威胁、实现主动防御策略方面发挥着日益关键的作用。在此背景下,作为人工智能分支的自然语言处理(NLP)已成为增强威胁情报能力的强大工具。本文综述全面概述了应用于威胁情报领域的NLP技术。首先阐述CTI作为保护数字资产重要工具的基础定义与原则,随后系统审视基于NLP的CTI技术,包括网络数据爬取、CTI数据分析、网络安全数据关系抽取、CTI共享协作及CTI安全威胁。最后深入探讨NLP在威胁情报中面临的挑战与局限,涵盖数据质量问题和伦理考量。本综述构建了完整框架,为安全专业人士和研究者了解前沿NLP威胁情报技术及其对网络安全的潜在影响提供重要参考。