Data Augmentation (DA) -- enriching training data by adding synthetic samples -- is a technique widely adopted in Computer Vision (CV) and Natural Language Processing (NLP) tasks to improve models performance. Yet, DA has struggled to gain traction in networking contexts, particularly in Traffic Classification (TC) tasks. In this work, we fulfill this gap by benchmarking 18 augmentation functions applied to 3 TC datasets using packet time series as input representation and considering a variety of training conditions. Our results show that (i) DA can reap benefits previously unexplored with (ii) augmentations acting on time series sequence order and masking being a better suit for TC and (iii) simple latent space analysis can provide hints about why augmentations have positive or negative effects.
翻译:数据增强(DA)——通过添加合成样本来丰富训练数据——是计算机视觉(CV)和自然语言处理(NLP)任务中广泛采用的技术,用以提升模型性能。然而,DA在网络领域,特别是流量分类(TC)任务中难以获得广泛应用。本研究通过将18种增强函数应用于3个TC数据集,以数据包时间序列作为输入表示,并考虑多种训练条件,弥补了这一空白。我们的研究结果表明:(i)DA能够带来此前未被探索的收益;(ii)作用于时间序列顺序和掩码的增强方法更适合TC任务;(iii)简单的潜在空间分析可以提供关于增强为何产生正面或负面效果的线索。