Synthetic data generation is an appealing tool for augmenting and enriching datasets, playing a crucial role in advancing artificial intelligence (AI) and machine learning (ML). Not only does synthetic data help build robust AI/ML datasets cost-effectively, but it also offers privacy-friendly solutions and bypasses the complexities of storing large data volumes. This paper proposes a novel method to generate synthetic data, based on first-order auto-regressive noise statistics, for large-scale Wi-Fi deployments. The approach operates with minimal real data requirements while producing statistically rich traffic patterns that effectively mimic real Access Point (AP) behavior. Experimental results show that ML models trained on synthetic data achieve Mean Absolute Error (MAE) values within 10 to 15 of those obtained using real data when trained on the same APs, while requiring significantly less training data. Moreover, when generalization is required, synthetic-data-trained models improve prediction accuracy by up to 50 percent compared to real-data-trained baselines, thanks to the enhanced variability and diversity of the generated traces. Overall, the proposed method bridges the gap between synthetic data generation and practical Wi-Fi traffic forecasting, providing a scalable, efficient, and real-time solution for modern wireless networks.
翻译:合成数据生成是一种用于增强和丰富数据集的实用工具,在推动人工智能(AI)和机器学习(ML)发展中发挥着关键作用。合成数据不仅有助于以经济高效的方式构建稳健的AI/ML数据集,还提供了隐私友好的解决方案,并规避了存储海量数据的复杂性。本文提出了一种基于一阶自回归噪声统计的新型合成数据生成方法,适用于大规模Wi-Fi部署场景。该方法仅需极少量的真实数据即可运行,同时能生成统计特征丰富的流量模式,有效模拟真实接入点(AP)的行为。实验结果表明,在相同AP上训练时,使用合成数据训练的ML模型获得的平均绝对误差(MAE)值与使用真实数据训练的模型相差仅在10%至15%之间,且所需训练数据量显著减少。此外,当需要进行泛化时,得益于生成数据轨迹增强的变异性和多样性,基于合成数据训练的模型相比基于真实数据训练的基线模型,预测精度提升最高可达50%。总体而言,所提出的方法弥合了合成数据生成与实际Wi-Fi流量预测之间的差距,为现代无线网络提供了一种可扩展、高效且实时的解决方案。