Mobile Internet has profoundly reshaped modern lifestyles in various aspects. Encrypted Traffic Classification (ETC) naturally plays a crucial role in managing mobile Internet, especially with the explosive growth of mobile apps using encrypted communication. Despite some existing learning-based ETC methods showing promising results, three-fold limitations still remain in real-world network environments, 1) label bias caused by traffic class imbalance, 2) traffic homogeneity caused by component sharing, and 3) training with reliance on sufficient labeled traffic. None of the existing ETC methods can address all these limitations. In this paper, we propose a novel Pre-trAining Semi-Supervised ETC framework, dubbed PASS. Our key insight is to resample the original train dataset and perform contrastive pre-training without using individual app labels directly to avoid label bias issues caused by class imbalance, while obtaining a robust feature representation to differentiate overlapping homogeneous traffic by pulling positive traffic pairs closer and pushing negative pairs away. Meanwhile, PASS designs a semi-supervised optimization strategy based on pseudo-label iteration and dynamic loss weighting algorithms in order to effectively utilize massive unlabeled traffic data and alleviate manual train dataset annotation workload. PASS outperforms state-of-the-art ETC methods and generic sampling approaches on four public datasets with significant class imbalance and traffic homogeneity, remarkably pushing the F1 of Cross-Platform215 with 1.31%, ISCX-17 with 9.12%. Furthermore, we validate the generality of the contrastive pre-training and pseudo-label iteration components of PASS, which can adaptively benefit ETC methods with diverse feature extractors.
翻译:移动互联网已深刻重塑现代生活的方方面面。加密流量分类(ETC)在管理移动互联网中自然扮演着关键角色,尤其是在使用加密通信的移动应用爆炸式增长的背景下。尽管现有的基于学习的ETC方法已展现出有前景的结果,但在真实网络环境中仍存在三方面限制:1)流量类别不平衡导致的标签偏差,2)组件共享导致的流量同质性,以及3)训练对充足标注流量的依赖。现有ETC方法均无法同时解决这些限制。本文提出一种新颖的预训练半监督ETC框架PASS。其核心思路是:对原始训练数据集进行重采样,并执行不直接使用独立应用标签的对比预训练,以避免类别不平衡引发的标签偏差问题;同时通过拉近正样本流量对、推远负样本对,获得区分重叠同质流量的鲁棒特征表示。此外,PASS设计了基于伪标签迭代和动态损失权重算法的半监督优化策略,以有效利用海量未标注流量数据,减轻人工训练数据集标注负担。在四个存在显著类别不平衡和流量同质性的公开数据集上,PASS优于最先进的ETC方法及通用采样方法,将Cross-Platform215的F1分数提升1.31%,ISCX-17提升9.12%。进一步,我们验证了PASS中对比预训练和伪标签迭代组件的通用性,它们可自适应地增强采用不同特征提取器的ETC方法。