S-DAPT-2026: A Stage-Aware Synthetic Dataset for Advanced Persistent Threat Detection

from arxiv, I would like to withdraw this paper as my supervisors request me to do so. We have identified significant errors in the analysis that affect the main conclusions of the paper. We are withdrawing the manuscript to correct these issues

The detection of advanced persistent threats (APTs) remains a crucial challenge due to their stealthy, multistage nature and the limited availability of realistic, labeled datasets for systematic evaluation. Synthetic dataset generation has emerged as a practical approach for modeling APT campaigns; however, existing methods often rely on computationally expensive alert correlation mechanisms that limit scalability. Motivated by these limitations, this paper presents a near realistic synthetic APT dataset and an efficient alert correlation framework. The proposed approach introduces a machine learning based correlation module that employs K Nearest Neighbors (KNN) clustering with a cosine similarity metric to group semantically related alerts within a temporal context. The dataset emulates multistage APT campaigns across campus and organizational network environments and captures a diverse set of fourteen distinct alert types, exceeding the coverage of commonly used synthetic APT datasets. In addition, explicit APT campaign states and alert to stage mappings are defined to enable flexible integration of new alert types and support stage aware analysis. A comprehensive statistical characterization of the dataset is provided to facilitate reproducibility and support APT stage predictions.

翻译：高级持续性威胁的检测仍是一项关键挑战，原因在于其具有隐蔽且多阶段的特性，同时缺乏可用于系统性评估的、带有标注的真实数据集。合成数据集生成已成为模拟APT活动的实用方法；然而，现有方法通常依赖计算开销较大的告警关联机制，这限制了其可扩展性。为此，本文提出一种接近真实场景的APT合成数据集及高效的告警关联框架。该框架引入基于机器学习的关联模块，采用基于余弦相似度度量的K近邻聚类方法，在时间上下文内将语义相关的告警进行分组。数据集模拟了校园网络与组织网络环境中的多阶段APT活动，并涵盖了十四种不同类型的告警，覆盖范围超过现有常用APT合成数据集。此外，本文明确定义了APT活动阶段及告警到阶段的映射关系，以支持新告警类型的灵活集成及阶段感知分析。为便于复现并支持APT阶段预测，本文还提供了数据集的全面统计特征描述。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

《红外图像中掩埋目标检测的深度学习方法》2026最新报告

专知会员服务

7+阅读 · 6月8日

《数据创新：桥接传统方法与大型语言模型以应对罕见高影响事件》最新报告

专知会员服务

18+阅读 · 2月25日

《人工智能增强监视分析：利用跨网络、陆地、空中及海上领域的威胁向量实时建模》

专知会员服务

29+阅读 · 2025年12月11日

《利用 LLM 进行高级持续性威胁 (APT) 检测和智能解释》

专知会员服务

24+阅读 · 2025年2月14日