Datasets of labeled network traces are essential for a multitude of machine learning (ML) tasks in networking, yet their availability is hindered by privacy and maintenance concerns, such as data staleness. To overcome this limitation, synthetic network traces can often augment existing datasets. Unfortunately, current synthetic trace generation methods, which typically produce only aggregated flow statistics or a few selected packet attributes, do not always suffice, especially when model training relies on having features that are only available from packet traces. This shortfall manifests in both insufficient statistical resemblance to real traces and suboptimal performance on ML tasks when employed for data augmentation. In this paper, we apply diffusion models to generate high-resolution synthetic network traffic traces. We present NetDiffusion, a tool that uses a finely-tuned, controlled variant of a Stable Diffusion model to generate synthetic network traffic that is high fidelity and conforms to protocol specifications. Our evaluation demonstrates that packet captures generated from NetDiffusion can achieve higher statistical similarity to real data and improved ML model performance than current state-of-the-art approaches (e.g., GAN-based approaches). Furthermore, our synthetic traces are compatible with common network analysis tools and support a myriad of network tasks, suggesting that NetDiffusion can serve a broader spectrum of network analysis and testing tasks, extending beyond ML-centric applications.
翻译:带标签的网络轨迹数据集对于网络中的众多机器学习(ML)任务至关重要,但其可用性常因隐私和运维问题(如数据时效性)而受限。为克服这一限制,合成网络轨迹通常可用于扩展现有数据集。然而,当前合成轨迹生成方法(通常仅能生成聚合流统计量或少量选定数据包属性)往往不足,尤其是当模型训练依赖于仅能从数据包轨迹中获取的特征时。这种缺陷表现为:与真实轨迹的统计相似性不足,以及在数据增强任务中ML性能欠佳。本文应用扩散模型生成高分辨率合成网络流量轨迹。我们提出NetDiffusion工具,通过采用经微调且受约束控制的Stable Diffusion模型变体,生成高保真度且符合协议规范的合成网络流量。评估表明,相比现有先进方法(如基于GAN的方法),NetDiffusion生成的数据包捕获结果与真实数据具有更高的统计相似性,并能提升ML模型性能。此外,我们的合成轨迹与常用网络分析工具兼容,支持多类网络任务,表明NetDiffusion可服务于更广泛的网络分析与测试场景,而不仅限于ML应用。