Traffic classification has a significant impact on maintaining the Quality of Service (QoS) of the network. Since traditional methods heavily rely on feature extraction and large scale labeled data, some recent pre-trained models manage to reduce the dependency by utilizing different pre-training tasks to train generic representations for network packets. However, existing pre-trained models typically adopt pre-training tasks developed for image or text data, which are not tailored to traffic data. As a result, the obtained traffic representations fail to fully reflect the information contained in the traffic, and may even disrupt the protocol information. To address this, we propose TraGe, a novel generic packet representation model for traffic classification. Based on the differences between the header and payload-the two fundamental components of a network packet-we perform differentiated pre-training according to the byte sequence variations (continuous in the header vs. discontinuous in the payload). A dynamic masking strategy is further introduced to prevent overfitting to fixed byte positions. Once the generic packet representation is obtained, TraGe can be finetuned for diverse traffic classification tasks using limited labeled data. Experimental results demonstrate that TraGe significantly outperforms state-of-the-art methods on two traffic classification tasks, with up to a 6.97% performance improvement. Moreover, TraGe exhibits superior robustness under parameter fluctuations and variations in sampling configurations.
翻译:流量分类对维护网络服务质量具有重要影响。由于传统方法严重依赖特征提取和大规模标注数据,近期一些预训练模型通过采用不同预训练任务学习网络数据包的通用表示,试图降低这种依赖性。然而,现有预训练模型通常采用为图像或文本数据设计的预训练任务,并非针对流量数据定制。这导致获取的流量表示未能充分反映流量所含信息,甚至可能破坏协议信息。为此,我们提出TraGe——一种用于流量分类的新型通用数据包表示模型。基于网络数据包两个基本组成部分(首部与载荷)间的差异,我们根据字节序列变化特征(首部连续vs.载荷离散)进行差异化预训练,并引入动态掩码策略防止对固定字节位置的过拟合。获得通用数据包表示后,TraGe可基于有限标注数据针对多种流量分类任务进行微调。实验结果表明,在两个流量分类任务上,TraGe显著优于现有最优方法,性能提升最高达6.97%。此外,TraGe在参数波动和采样配置变化下展现出优越的鲁棒性。