Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.
翻译:监控威胁态势以了解实际或潜在攻击对网络安全专业人员至关重要。网络威胁信息通常通过自然语言报告进行传播。自然语言处理有助于管理这类大量非结构化信息,但迄今为止该主题受到的关注仍然有限。本文提出AnnoCTR——一个采用CC-BY-SA许可协议的新型网络威胁报告数据集。领域专家对报告中的命名实体、时间表达式以及网络安全特定概念(包括隐式提及的技术与策略)进行了标注。实体与概念关联至维基百科和MITRE ATT&CK知识库(最广泛使用的攻击类型分类法)。此前关联MITRE ATT&CK的数据集或为每篇文档提供单一标签,或脱离上下文对句子进行标注;而本数据集以更细粒度方式对整个文档进行标注。在实验研究中,我们采用先进神经模型对本数据集的标注进行建模。在少样本场景下,我们发现:对于识别文本中显式或隐式提及的MITRE ATT&CK概念,该知识库中的概念描述是进行训练数据增强的有效来源。