AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.

翻译：监控威胁态势以了解实际或潜在攻击对网络安全专业人员至关重要。网络威胁信息通常通过自然语言报告进行传播。自然语言处理有助于管理这类大量非结构化信息，但迄今为止该主题受到的关注仍然有限。本文提出AnnoCTR——一个采用CC-BY-SA许可协议的新型网络威胁报告数据集。领域专家对报告中的命名实体、时间表达式以及网络安全特定概念（包括隐式提及的技术与策略）进行了标注。实体与概念关联至维基百科和MITRE ATT&CK知识库（最广泛使用的攻击类型分类法）。此前关联MITRE ATT&CK的数据集或为每篇文档提供单一标签，或脱离上下文对句子进行标注；而本数据集以更细粒度方式对整个文档进行标注。在实验研究中，我们采用先进神经模型对本数据集的标注进行建模。在少样本场景下，我们发现：对于识别文本中显式或隐式提及的MITRE ATT&CK概念，该知识库中的概念描述是进行训练数据增强的有效来源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日