Mobile Internet has profoundly reshaped modern lifestyles in various aspects. Encrypted Traffic Classification (ETC) naturally plays a crucial role in managing mobile Internet, especially with the explosive growth of mobile apps using encrypted communication. Despite some existing learning-based ETC methods showing promising results, three-fold limitations still remain in real-world network environments, 1) label bias caused by traffic class imbalance, 2) traffic homogeneity caused by component sharing, and 3) training with reliance on sufficient labeled traffic. None of the existing ETC methods can address all these limitations. In this paper, we propose a novel Pre-trAining Semi-Supervised ETC framework, dubbed PASS. Our key insight is to resample the original train dataset and perform contrastive pre-training without using individual app labels directly to avoid label bias issues caused by class imbalance, while obtaining a robust feature representation to differentiate overlapping homogeneous traffic by pulling positive traffic pairs closer and pushing negative pairs away. Meanwhile, PASS designs a semi-supervised optimization strategy based on pseudo-label iteration and dynamic loss weighting algorithms in order to effectively utilize massive unlabeled traffic data and alleviate manual train dataset annotation workload. PASS outperforms state-of-the-art ETC methods and generic sampling approaches on four public datasets with significant class imbalance and traffic homogeneity, remarkably pushing the F1 of Cross-Platform215 with 1.31%, ISCX-17 with 9.12%. Furthermore, we validate the generality of the contrastive pre-training and pseudo-label iteration components of PASS, which can adaptively benefit ETC methods with diverse feature extractors.
翻译:移动互联网已深刻重塑现代生活的诸多方面。加密流量分类(ETC)在移动互联网管理中自然扮演着关键角色,尤其是随着使用加密通信的移动应用程序爆炸式增长。尽管现有一些基于学习的ETC方法已展现出可观成果,但在真实网络环境中仍存在三重局限:1) 流量类别不平衡导致的标签偏差,2) 组件共享引发的流量同质性,3) 训练高度依赖充足标注流量。现有ETC方法均无法同时解决这些局限。本文提出一种新型预训练半监督ETC框架PASS。核心思路是对原始训练数据集进行重采样,并执行对比预训练,从而直接避免使用单个应用标签以规避类别不平衡导致的标签偏差,同时通过拉近正样本对、推远负样本对的方式获取鲁棒特征表示以区分重叠的同质流量。此外,PASS设计基于伪标签迭代与动态损失加权算法的半监督优化策略,以有效利用海量未标注流量数据并减轻人工训练数据集标注负担。在存在显著类别不平衡与流量同质性的四个公开数据集上,PASS全面超越现有最优ETC方法与通用采样技术,在Cross-Platform215上F1值提升1.31%,在ISCX-17上提升9.12%。最后,我们验证了PASS中对比预训练与伪标签迭代组件的通用性,其能够自适应地增强采用不同特征提取器的ETC方法性能。