Provenance graph analysis plays a vital role in intrusion detection, particularly against Advanced Persistent Threats (APTs), by exposing complex attack patterns. While recent systems combine graph neural networks (GNNs) with natural language processing (NLP) to capture structural and semantic features, their effectiveness is limited by class imbalance in real-world data. To address this, we introduce PROVSYN, a novel hybrid provenance graph synthesis framework, which comprises three components: (1) graph structure synthesis via heterogeneous graph generation models, (2) textual attribute synthesis via fine-tuned Large Language Models (LLMs), and (3) five-dimensional fidelity evaluation. Experiments on six benchmark datasets demonstrate that PROVSYN consistently produces higher-fidelity graphs across the five evaluation dimensions compared to four strong baselines. To further demonstrate the practical utility of PROVSYN, we utilize the synthesized graphs to augment training datasets for downstream APT detection models. The results show that PROVSYN effectively mitigates data imbalance, improving normalized entropy by up to 35%, and enhances the generalizability of downstream detection models, achieving an accuracy improvement of up to 38%.
翻译:溯源图谱分析通过揭示复杂的攻击模式,在入侵检测中发挥着至关重要的作用,尤其是在应对高级持续性威胁(APTs)方面。虽然近期的系统将图神经网络(GNNs)与自然语言处理(NLP)相结合以捕捉结构和语义特征,但其有效性受限于现实数据中的类别不平衡问题。为解决此问题,我们提出了PROVSYN,一种新颖的混合式溯源图谱合成框架。该框架包含三个组成部分:(1) 通过异构图生成模型进行图结构合成,(2) 通过微调的大型语言模型(LLMs)进行文本属性合成,以及(3) 五维保真度评估。在六个基准数据集上的实验表明,与四个强基线方法相比,PROVSYN在五个评估维度上均能持续生成更高保真度的图谱。为进一步证明PROVSYN的实际效用,我们利用合成的图谱来增强下游APT检测模型的训练数据集。结果表明,PROVSYN能有效缓解数据不平衡问题,将归一化熵提升高达35%,并增强了下游检测模型的泛化能力,实现了高达38%的准确率提升。