This paper investigates the use of a pre-trained language model and siamese network to discern sibling relationships between text-based cybersecurity vulnerability data. The ultimate purpose of the approach presented in this paper is towards the construction of hierarchical attack models based on a set of text descriptions characterising potential/observed vulnerabilities in a given system. Due to the nature of the data, and the uncertainty sensitive environment in which the problem is presented, a practically oriented soft computing approach is necessary. Therefore, a key focus of this work is to investigate practical questions surrounding the reliability of predicted links towards the construction of such models, to which end conceptual and practical challenges and solutions associated with the proposed approach are outlined, such as dataset complexity and stability of predictions. Accordingly, the contributions of this paper focus on producing neural networks using a pre-trained language model for predicting sibling relationships between cybersecurity vulnerabilities, then outlining how to apply this capability towards the generation of hierarchical attack models. In addition, two data sampling mechanisms for tackling data complexity, and a consensus mechanism for reducing the amount of false positive predictions are outlined. Each of these approaches is compared and contrasted using empirical results from three sets of cybersecurity data to determine their effectiveness.
翻译:本文研究利用预训练语言模型和孪生网络识别基于文本的网络安全漏洞数据间的同源关系。所提出方法的最终目标在于,基于描述特定系统中潜在/已观测漏洞的文本描述集合,构建层次化攻击模型。鉴于数据特性及问题所处的不确定性敏感环境,需要采用面向实践的软计算方法。因此,本工作的核心聚焦于探究此类模型构建过程中预测关联可靠性的实际问题,为此系统阐述了所提方法面临的概念性与实践性挑战及解决方案,例如数据集复杂性和预测稳定性问题。具体而言,本文的贡献在于:首先构建基于预训练语言模型的神经网络以预测网络安全漏洞间的同源关系,进而阐明如何将该能力应用于层次化攻击模型的生成。此外,提出了两种应对数据复杂性的数据采样机制,以及一种降低误报预测的共识机制。通过三组网络安全数据的实证结果,对这些方法进行了比较与对比,以评估其有效性。