Data privacy and protection through anonymization is a critical issue for network operators or data owners before it is forwarded for other possible use of data. With the adoption of Artificial Intelligence (AI), data anonymization augments the likelihood of covering up necessary sensitive information; preventing data leakage and information loss. OpenWiFi networks are vulnerable to any adversary who is trying to gain access or knowledge on traffic regardless of the knowledge possessed by data owners. The odds for discovery of actual traffic information is addressed by applied conditional tabular generative adversarial network (CTGAN). CTGAN yields synthetic data; which disguises as actual data but fostering hidden acute information of actual data. In this paper, the similarity assessment of synthetic with actual data is showcased in terms of clustering algorithms followed by a comparison of performance for unsupervised cluster validation metrics. A well-known algorithm, K-means outperforms other algorithms in terms of similarity assessment of synthetic data over real data while achieving nearest scores 0.634, 23714.57, and 0.598 as Silhouette, Calinski and Harabasz and Davies Bouldin metric respectively. On exploiting a comparative analysis in validation scores among several algorithms, K-means forms the epitome of unsupervised clustering algorithms ensuring explicit usage of synthetic data at the same time a replacement for real data. Hence, the experimental results aim to show the viability of using CTGAN-generated synthetic data in lieu of publishing anonymized data to be utilized in various applications.
翻译:数据隐私与保护的匿名化处理是网络运营商或数据所有者在将数据转发用于其他可能用途之前面临的关键问题。随着人工智能技术的应用,数据匿名化提高了掩盖必要敏感信息的可能性,从而防止数据泄露和信息损失。开源WiFi网络易受任何试图获取流量访问权限或知识(无论数据所有者是否知情)的攻击者攻击。通过应用条件表格生成对抗网络,解决了实际流量信息被发现的概率问题。CTGAN生成看似真实数据但蕴含实际数据潜在敏感信息的合成数据。本文通过聚类算法展示合成数据与实际数据的相似性评估,并比较无监督聚类验证指标的性能。著名算法K-means在合成数据与真实数据的相似性评估中表现最优,其轮廓系数、Calinski-Harabasz指数和Davies-Bouldin指标分别达到0.634、23714.57和0.598。通过多种算法验证分数的比较分析,K-means成为无监督聚类算法的典范,既确保合成数据的显式使用,又成为真实数据的替代方案。因此,实验结果旨在展示使用CTGAN生成合成数据替代发布匿名化数据以应用于多种场景的可行性。