Contrastive Learning and Data Augmentation in Traffic Classification Using a Flowpic Input Representation

Over the last years we witnessed a renewed interest towards Traffic Classification (TC) captivated by the rise of Deep Learning (DL). Yet, the vast majority of TC literature lacks code artifacts, performance assessments across datasets and reference comparisons against Machine Learning (ML) methods. Among those works, a recent study from IMC'22 [17] is worth of attention since it adopts recent DL methodologies (namely, few-shot learning, self-supervision via contrastive learning and data augmentation) appealing for networking as they enable to learn from a few samples and transfer across datasets. The main result of [17] on the UCDAVIS19, ISCX-VPN and ISCX-Tor datasets is that, with such DL methodologies, 100 input samples are enough to achieve very high accuracy using an input representation called "flowpic" (i.e., a per-flow 2d histograms of the packets size evolution over time). In this paper (i) we reproduce [17] on the same datasets and (ii) we replicate its most salient aspect (the importance of data augmentation) on three additional public datasets, MIRAGE-19, MIRAGE-22 and UTMOBILENET21. While we confirm most of the original results, we also found a 20% accuracy drop on some of the investigated scenarios due to a data shift in the original dataset that we uncovered. Additionally, our study validates that the data augmentation strategies studied in [17] perform well on other datasets too. In the spirit of reproducibility and replicability we make all artifacts (code and data) available at [10].

翻译：近年来，随着深度学习的兴起，流量分类领域重新受到关注。然而，绝大多数流量分类文献缺乏代码工件、跨数据集的性能评估以及与机器学习方法的参考对比。在这些工作中，来自IMC'22的一份最新研究[17]值得关注，因为它采用了最新的深度学习方法（即少样本学习、通过对比学习实现的自监督以及数据增强），这些方法对网络领域具有吸引力，因为它们能够从少量样本中学习并在数据集之间迁移。[17]在UCDAVIS19、ISCX-VPN和ISCX-Tor数据集上的主要结果表明，使用这些深度学习方法，仅需100个输入样本，就能利用称为"flowpic"的输入表示（即每个流的数据包大小随时间演变的二维直方图）达到非常高的准确率。本文中，(i) 我们在相同数据集上复现了[17]的研究，(ii) 我们在三个额外的公开数据集MIRAGE-19、MIRAGE-22和UTMOBILENET21上复现了其最显著的方面（数据增强的重要性）。尽管我们确认了大部分原始结果，但我们也发现，由于我们发现的原始数据集中的数据偏移，在部分研究场景中准确率下降了20%。此外，我们的研究验证了[17]中研究的数据增强策略在其他数据集上也表现良好。本着可复现和可重复的精神，我们在[10]中公开了所有工件（代码和数据）。