Advanced Persistent Threats (APTs) have caused significant losses across a wide range of sectors, including the theft of sensitive data and harm to system integrity. As attack techniques grow increasingly sophisticated and stealthy, the arms race between cyber defenders and attackers continues to intensify. The revolutionary impact of Large Language Models (LLMs) has opened up numerous opportunities in various fields, including cybersecurity. An intriguing question arises: can the extensive knowledge embedded in LLMs be harnessed for provenance analysis and play a positive role in identifying previously unknown malicious events? To seek a deeper understanding of this issue, we propose a new strategy for taking advantage of LLMs in provenance-based threat detection. In our design, the state-of-the-art LLM offers additional details in provenance data interpretation, leveraging their knowledge of system calls, software identity, and high-level understanding of application execution context. The advanced contextualized embedding capability is further utilized to capture the rich semantics of event descriptions. We comprehensively examine the quality of the resulting embeddings, and it turns out that they offer promising avenues. Subsequently, machine learning models built upon these embeddings demonstrated outstanding performance on real-world data. In our evaluation, supervised threat detection achieves a precision of 99.0%, and semi-supervised anomaly detection attains a precision of 96.9%.
翻译:高级持续性威胁(APT)已造成包括敏感数据窃取与系统完整性破坏在内的广泛领域重大损失。随着攻击技术日益复杂和隐蔽,网络防御者与攻击者间的军备竞赛持续加剧。大语言模型(LLM)的革命性影响为包括网络安全在内的众多领域开辟了广阔前景。一个值得深思的问题随之产生:能否利用LLM中蕴含的广博知识赋能溯源分析,并在识别未知恶意事件中发挥积极作用?为深入探索该问题,我们提出一种基于溯源分析的威胁检测新策略,旨在充分发挥LLM的潜力。在本设计中,前沿LLM通过其对系统调用、软件标识及应用程序执行上下文的高层理解,为溯源数据解析提供增强的语义信息。我们进一步利用其先进的上下文嵌入能力,以捕捉事件描述中丰富的语义特征。通过对生成嵌入向量的质量进行全面评估,结果表明其具备显著的应用潜力。基于这些嵌入向量构建的机器学习模型在真实数据上展现出卓越性能:监督式威胁检测达到99.0%的精确率,半监督异常检测实现96.9%的精确率。