Network traffic refers to the amount of information being sent and received over the internet or any system that connects computers. Analyzing and understanding network traffic is vital for improving network security and management. However, the analysis of network traffic poses great challenges due to the unique characteristics of data packets, such as heterogeneous headers and encrypted payload lacking semantics. To capture the latent semantics of traffic, a few studies have adopted pre-training techniques based on the Transformer encoder or decoder to learn the representations from large-scale traffic data. However, these methods typically excel only in traffic understanding (classification) or traffic generation tasks. To address this issue, we develop Lens, a foundational network traffic model that leverages the T5 architecture to learn the pre-trained representations from large-scale unlabeled data. Harnessing the strength of the encoder-decoder framework, which captures the global information while preserving the generative ability, our model can better learn the representations from large-scale network traffic. To further enhance pre-training performance, we design a novel loss that integrates three distinct tasks, namely Masked Span Prediction (MSP), Packet Order Prediction (POP), and Homologous Traffic Prediction (HTP). Evaluation results on multiple benchmark datasets demonstrate that the proposed Lens outperforms the baselines in most downstream tasks related to both traffic understanding and traffic generation. Notably, it also requires considerably less labeled data for fine-tuning compared to current methods.
翻译:网络流量是指通过互联网或任何连接计算机的系统发送和接收的信息量。分析和理解网络流量对于提升网络安全和管理至关重要。然而,由于数据包具有异构头部、加密载荷缺乏语义等独特特征,网络流量分析面临巨大挑战。为捕获流量的潜在语义,少数研究采用基于Transformer编码器或解码器的预训练技术,从大规模流量数据中学习表示。但这些方法通常仅在流量理解(分类)或流量生成任务中表现优异。为解决这一问题,我们开发了Lens,一种基于T5架构的基础网络流量模型,可从大规模无标签数据中学习预训练表示。通过利用编码器-解码器框架在捕获全局信息的同时保持生成能力的优势,我们的模型能更好地从大规模网络流量中学习表示。为进一步提升预训练性能,我们设计了一种融合三项不同任务的新型损失函数,即掩码跨度预测(MSP)、数据包顺序预测(POP)和同源流量预测(HTP)。在多个基准数据集上的评估结果表明,所提出的Lens在大多数涉及流量理解和流量生成的下游任务中优于基线方法。值得注意的是,与现有方法相比,它进行微调所需的标注数据也显著减少。