The detection of advanced persistent threats (APTs) remains a crucial challenge due to their stealthy, multistage nature and the limited availability of realistic, labeled datasets for systematic evaluation. Synthetic dataset generation has emerged as a practical approach for modeling APT campaigns; however, existing methods often rely on computationally expensive alert correlation mechanisms that limit scalability. Motivated by these limitations, this paper presents a near realistic synthetic APT dataset and an efficient alert correlation framework. The proposed approach introduces a machine learning based correlation module that employs K Nearest Neighbors (KNN) clustering with a cosine similarity metric to group semantically related alerts within a temporal context. The dataset emulates multistage APT campaigns across campus and organizational network environments and captures a diverse set of fourteen distinct alert types, exceeding the coverage of commonly used synthetic APT datasets. In addition, explicit APT campaign states and alert to stage mappings are defined to enable flexible integration of new alert types and support stage aware analysis. A comprehensive statistical characterization of the dataset is provided to facilitate reproducibility and support APT stage predictions.
翻译:高级持续性威胁的检测仍是一项关键挑战,原因在于其具有隐蔽且多阶段的特性,同时缺乏可用于系统性评估的、带有标注的真实数据集。合成数据集生成已成为模拟APT活动的实用方法;然而,现有方法通常依赖计算开销较大的告警关联机制,这限制了其可扩展性。为此,本文提出一种接近真实场景的APT合成数据集及高效的告警关联框架。该框架引入基于机器学习的关联模块,采用基于余弦相似度度量的K近邻聚类方法,在时间上下文内将语义相关的告警进行分组。数据集模拟了校园网络与组织网络环境中的多阶段APT活动,并涵盖了十四种不同类型的告警,覆盖范围超过现有常用APT合成数据集。此外,本文明确定义了APT活动阶段及告警到阶段的映射关系,以支持新告警类型的灵活集成及阶段感知分析。为便于复现并支持APT阶段预测,本文还提供了数据集的全面统计特征描述。