Representation Learning (RL) plays a pivotal role in the success of many problems including cyberattack detection. Most of the RL methods for cyberattack detection are based on the latent vector of Auto-Encoder (AE) models. An AE transforms raw data into a new latent representation that better exposes the underlying characteristics of the input data. Thus, it is very useful for identifying cyberattacks. However, due to the heterogeneity and sophistication of cyberattacks, the representation of AEs is often entangled/mixed resulting in the difficulty for downstream attack detection models. To tackle this problem, we propose a novel mod called Twin Auto-Encoder (TAE). TAE deterministically transforms the latent representation into a more distinguishable representation namely the \textit{separable representation} and the reconstructsuct the separable representation at the output. The output of TAE called the \textit{reconstruction representation} is input to downstream models to detect cyberattacks. We extensively evaluate the effectiveness of TAE using a wide range of bench-marking datasets. Experiment results show the superior accuracy of TAE over state-of-the-art RL models and well-known machine learning algorithms. Moreover, TAE also outperforms state-of-the-art models on some sophisticated and challenging attacks. We then investigate various characteristics of TAE to further demonstrate its superiority.
翻译:表示学习在包括网络攻击检测在内的许多问题中发挥着关键作用。当前针对网络攻击检测的表示学习方法大多基于自编码器模型的隐向量。自编码器将原始数据转换为新的潜在表示,能更清晰地揭示输入数据的本质特征,因此对识别网络攻击非常有效。然而,由于网络攻击的异质性和复杂性,自编码器的表示往往存在纠缠/混杂问题,导致下游攻击检测模型难以处理。为解决该问题,我们提出了一种名为双自编码器的新型模型。双自编码器确定性地将潜在表示转换为更具区分性的表示(即可分离表示),并在输出端重构该可分离表示。双自编码器的输出被称为重构表示,输入下游模型用于检测网络攻击。我们使用多种基准数据集对双自编码器的有效性进行了全面评估。实验结果表明,双自编码器的准确率优于最先进的表示学习模型和经典机器学习算法。此外,在部分复杂且具有挑战性的攻击场景中,双自编码器同样超越了现有最优模型。我们进一步分析了双自编码器的多种特性,以证明其卓越性能。