The popularity of Deep Learning (DL), coupled with network traffic visibility reduction due to the increased adoption of HTTPS, QUIC and DNS-SEC, re-ignited interest towards Traffic Classification (TC). However, to tame the dependency from task-specific large labeled datasets we need to find better ways to learn representations that are valid across tasks. In this work we investigate this problem comparing transfer learning, meta-learning and contrastive learning against reference Machine Learning (ML) tree-based and monolithic DL models (16 methods total). Using two publicly available datasets, namely MIRAGE19 (40 classes) and AppClassNet (500 classes), we show that (i) using large datasets we can obtain more general representations, (ii) contrastive learning is the best methodology and (iii) meta-learning the worst one, and (iv) while ML tree-based cannot handle large tasks but fits well small tasks, by means of reusing learned representations, DL methods are reaching tree-based models performance also for small tasks.
翻译:深度学习(DL)的普及,加之HTTPS、QUIC和DNS-SEC采用率的提升导致网络流量可见性降低,重新激发了人们对流量分类(TC)的兴趣。然而,为摆脱对特定任务的大规模标注数据集的依赖,我们需要寻找更优的跨任务通用表征学习方法。本研究通过对比迁移学习、元学习与对比学习,以及基于参考机器学习(ML)的树模型和单体深度学习模型(共计16种方法),探究该问题。利用两个公开数据集(MIRAGE19含40类,AppClassNet含500类),我们证实:(i)使用大规模数据集可获得更具通用性的表征;(ii)对比学习是最优方法,而元学习效果最差;(iii)尽管基于树的机器学习模型难以处理大规模任务,但在小规模任务中表现优异;(iv)通过复用已学习的表征,深度学习方法在小规模任务中的性能已可媲美树模型。