All data on the Internet are transferred by network traffic, thus accurately modeling network traffic can help improve network services quality and protect data privacy. Pretrained models for network traffic can utilize large-scale raw data to learn the essential characteristics of network traffic, and generate distinguishable results for input traffic without considering specific downstream tasks. Effective pretrained models can significantly optimize the training efficiency and effectiveness of downstream tasks, such as application classification, attack detection and traffic generation. Despite the great success of pretraining in natural language processing, there is no work in the network field. Considering the diverse demands and characteristics of network traffic and network tasks, it is non-trivial to build a pretrained model for network traffic and we face various challenges, especially the heterogeneous headers and payloads in the multi-pattern network traffic and the different dependencies for contexts of diverse downstream network tasks. To tackle these challenges, in this paper, we make the first attempt to provide a generative pretrained model NetGPT for both traffic understanding and generation tasks. We propose the multi-pattern network traffic modeling to construct unified text inputs and support both traffic understanding and generation tasks. We further optimize the adaptation effect of the pretrained model to diversified tasks by shuffling header fields, segmenting packets in flows, and incorporating diverse task labels with prompts. With diverse traffic datasets from encrypted software, DNS, private industrial protocols and cryptocurrency mining, expensive experiments demonstrate the effectiveness of our NetGPT in a range of traffic understanding and generation tasks on traffic datasets, and outperform state-of-the-art baselines by a wide margin.
翻译:互联网中的所有数据均通过网络流量传输,因此精准建模网络流量有助于提升网络服务质量并保护数据隐私。针对网络流量的预训练模型能够利用大规模原始数据学习流量的本质特征,无需考虑特定下游任务即可为输入流量生成可区分的结果。有效的预训练模型可显著优化应用分类、攻击检测及流量生成等下游任务的训练效率与效果。尽管预训练在自然语言处理领域取得了巨大成功,但在网络领域尚无相关研究。考虑到网络流量及网络任务的多样化需求与特性,构建用于网络流量的预训练模型并非易事,我们面临诸多挑战,尤其是多模式网络流量中异构的头部字段与负载内容,以及不同下游网络任务对上下文依赖性的差异。为应对这些挑战,本文首次尝试提出生成式预训练模型NetGPT,同时支持流量理解与生成任务。我们提出多模式网络流量建模方法,构建统一的文本输入并同时支持流量理解与生成任务。通过混洗头部字段、按流分割数据包以及结合提示引入多样化任务标签,进一步优化预训练模型对多样化任务的适配效果。基于加密软件、DNS、私有工业协议及加密货币采矿等多种流量数据集,大量实验证明了NetGPT在流量理解与生成任务上的有效性,并在多项流量数据集上以显著优势超越当前最优基准模型。