Replication: Contrastive Learning and Data Augmentation in Traffic Classification Using a Flowpic Input Representation

Over the last years we witnessed a renewed interest toward Traffic Classification (TC) captivated by the rise of Deep Learning (DL). Yet, the vast majority of TC literature lacks code artifacts, performance assessments across datasets and reference comparisons against Machine Learning (ML) methods. Among those works, a recent study from IMC22 [16] is worth of attention since it adopts recent DL methodologies (namely, few-shot learning, self-supervision via contrastive learning and data augmentation) appealing for networking as they enable to learn from a few samples and transfer across datasets. The main result of [16] on the UCDAVIS19, ISCX-VPN and ISCX-Tor datasets is that, with such DL methodologies, 100 input samples are enough to achieve very high accuracy using an input representation called "flowpic" (i.e., a per-flow 2d histograms of the packets size evolution over time). In this paper (i) we reproduce [16] on the same datasets and (ii) we replicate its most salient aspect (the importance of data augmentation) on three additional public datasets (MIRAGE19, MIRAGE22 and UTMOBILENET21). While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset that we uncovered. Additionally, our study validates that the data augmentation strategies studied in [16] perform well on other datasets too. In the spirit of reproducibility and replicability we make all artifacts (code and data) available to the research community at https://tcbenchstack.github.io/tcbench/

翻译：论文摘要：近年来，随着深度学习的兴起，流量分类领域重新引起了广泛关注。然而，绝大多数流量分类文献缺乏代码实现、跨数据集性能评估以及与机器学习方法的基准对比。在这些研究中，IMC22 [16] 的一项近期工作值得关注，因为它采用了最新的深度学习方法（即小样本学习、基于对比学习的自监督以及数据增强），这些方法在网络领域颇具吸引力，因为它们能够从少量样本中学习并在数据集之间迁移。该研究在UCDAVIS19、ISCX-VPN和ISCX-Tor数据集上的主要结果表明，采用这些深度学习方法时，仅需100个输入样本，使用名为"flowpic"的输入表示（即每个流的数据包大小随时间演变的二维直方图）即可达到极高的准确率。本文中，我们（i）在相同数据集上复现了[16]的工作，（ii）在另外三个公开数据集（MIRAGE19、MIRAGE22和UTMOBILENET21）上复现了其最突出的方面（数据增强的重要性）。虽然我们确认了大部分原始结果，但也在某些研究场景中发现了因原始数据集中的数据偏移而导致的20%准确率下降。此外，我们的研究验证了[16]所研究的数据增强策略在其他数据集上同样表现良好。本着可重现性和可复现性的精神，我们已将全部成果（代码和数据）开放给研究社区，访问地址为 https://tcbenchstack.github.io/tcbench/。