Cyber-attack attribution is an important process that allows experts to put in place attacker-oriented countermeasures and legal actions. The analysts mainly perform attribution manually, given the complex nature of this task. AI and, more specifically, Natural Language Processing (NLP) techniques can be leveraged to support cybersecurity analysts during the attribution process. However powerful these techniques are, they need to deal with the lack of datasets in the attack attribution domain. In this work, we will fill this gap and will provide, to the best of our knowledge, the first dataset on cyber-attack attribution. We designed our dataset with the primary goal of extracting attack attribution information from cybersecurity texts, utilizing named entity recognition (NER) methodologies from the field of NLP. Unlike other cybersecurity NER datasets, ours offers a rich set of annotations with contextual details, including some that span phrases and sentences. We conducted extensive experiments and applied NLP techniques to demonstrate the dataset's effectiveness for attack attribution. These experiments highlight the potential of Large Language Models (LLMs) capabilities to improve the NER tasks in cybersecurity datasets for cyber-attack attribution.
翻译:网络攻击归因是一项关键流程,使专家能够实施针对攻击者的反制措施与法律行动。鉴于该任务的复杂性,分析人员目前主要依赖人工方式进行归因。人工智能技术,特别是自然语言处理(NLP)方法,可为网络安全分析师在归因过程中提供有力支持。尽管这些技术功能强大,却始终面临攻击归因领域数据集匮乏的挑战。本研究致力于填补这一空白,据我们所知,首次构建了面向网络攻击归因的专用数据集。该数据集以从网络安全文本中提取攻击归因信息为核心目标,采用了NLP领域的命名实体识别(NER)方法。与现有网络安全NER数据集相比,本数据集提供了包含上下文细节的丰富标注体系,其中部分标注跨越短语和句子边界。我们通过大量实验并应用NLP技术,验证了该数据集在攻击归因任务中的有效性。实验结果表明,大型语言模型(LLMs)能够显著提升网络安全数据集中NER任务的表现,从而增强网络攻击归因能力。