Mobile Internet has profoundly reshaped modern lifestyles in various aspects. Encrypted Traffic Classification (ETC) naturally plays a crucial role in managing mobile Internet, especially with the explosive growth of mobile apps using encrypted communication. Despite some existing learning-based ETC methods showing promising results, three-fold limitations still remain in real-world network environments, 1) label bias caused by traffic class imbalance, 2) traffic homogeneity caused by component sharing, and 3) training with reliance on sufficient labeled traffic. None of the existing ETC methods can address all these limitations. In this paper, we propose a novel Pre-trAining Semi-Supervised ETC framework, dubbed PASS. Our key insight is to resample the original train dataset and perform contrastive pre-training without using individual app labels directly to avoid label bias issues caused by class imbalance, while obtaining a robust feature representation to differentiate overlapping homogeneous traffic by pulling positive traffic pairs closer and pushing negative pairs away. Meanwhile, PASS designs a semi-supervised optimization strategy based on pseudo-label iteration and dynamic loss weighting algorithms in order to effectively utilize massive unlabeled traffic data and alleviate manual train dataset annotation workload. PASS outperforms state-of-the-art ETC methods and generic sampling approaches on four public datasets with significant class imbalance and traffic homogeneity, remarkably pushing the F1 of Cross-Platform215 with 1.31%, ISCX-17 with 9.12%. Furthermore, we validate the generality of the contrastive pre-training and pseudo-label iteration components of PASS, which can adaptively benefit ETC methods with diverse feature extractors.
翻译:移动互联网已深刻重塑现代生活各方面。加密流量分类自然在移动互联网管理中扮演关键角色,尤其是在采用加密通信的移动应用爆炸式增长背景下。尽管现有基于学习的加密流量分类方法已展现一定成效,但在真实网络环境中仍存在三重局限:1) 流量类别不平衡导致的标签偏差,2) 组件共享引发的流量同质性,3) 依赖充足标注流量的训练条件。现有加密流量分类方法均无法同时解决所有上述问题。本文提出一种名为PASS的新型预训练半监督加密流量分类框架。核心思路在于:对原始训练数据集进行重采样,并实施不直接使用应用个体标签的对比预训练,从而避免类别不平衡引发的标签偏差问题;同时通过拉近正例流量对、推开负例流量对的策略,获得能够区分重叠同质流量的鲁棒特征表示。PASS还设计了基于伪标签迭代与动态损失权重算法的半监督优化策略,以有效利用海量无标签流量数据并减轻人工训练数据集标注负担。在四个存在显著类别不平衡与流量同质性的公开数据集上,PASS均优于现有最优加密流量分类方法与通用采样策略,其在Cross-Platform215数据集上F1值提升1.31%,ISCX-17数据集上提升9.12%。此外,我们验证了PASS中对比预训练与伪标签迭代组件的通用性,这些组件可自适应地增强采用不同特征提取器的加密流量分类方法。