Network traffic classification is vital for network security and management. The pre-training technology has shown promise by learning general traffic representations from raw byte sequences, thereby reducing reliance on labeled data. However, existing pre-trained models struggle with the gap between traffic heterogeneity (i.e., hierarchical traffic structures) and input homogeneity (i.e., flattened byte sequences). To address this gap, we propose Nethira, a heterogeneity-aware pre-trained model based on hierarchical reconstruction and augmentation. In pre-training, Nethira introduces hierarchical reconstruction at multiple levels-byte, protocol, and packet-capturing comprehensive traffic structural information. During fine-tuning, Nethira proposes a consistency-regularized strategy with hierarchical traffic augmentation to reduce label dependence. Experiments on four public datasets demonstrate that Nethira outperforms seven existing pre-trained models, achieving an average F1-score improvement of 9.11%, and reaching comparable performance with only 1% labeled data on high-heterogeneity network tasks.
翻译:网络流量分类对网络安全与管理至关重要。预训练技术通过从原始字节序列中学习通用流量表示,展现出降低对标注数据依赖的潜力。然而,现有预训练模型难以弥合流量异构性(即分层的流量结构)与输入同质性(即扁平化的字节序列)之间的鸿沟。为解决此问题,我们提出Nethira——一种基于分层重构与增强的异构感知预训练模型。在预训练阶段,Nethira在字节、协议与数据包三个层级引入分层重构机制,以捕获全面的流量结构信息。在微调阶段,Nethira提出结合分层流量增强的一致性正则化策略以降低标签依赖。在四个公开数据集上的实验表明,Nethira在七种现有预训练模型中表现最优,平均F1分数提升达9.11%,并在高异构性网络任务中仅使用1%标注数据即可达到可比性能。