Preservation of private user data is of paramount importance for high Quality of Experience (QoE) and acceptability, particularly with services treating sensitive data, such as IT-based health services. Whereas anonymization techniques were shown to be prone to data re-identification, synthetic data generation has gradually replaced anonymization since it is relatively less time and resource-consuming and more robust to data leakage. Generative Adversarial Networks (GANs) have been used for generating synthetic datasets, especially GAN frameworks adhering to the differential privacy phenomena. This research compares state-of-the-art GAN-based models for synthetic data generation to generate time-series synthetic medical records of dementia patients which can be distributed without privacy concerns. Predictive modeling, autocorrelation, and distribution analysis are used to assess the Quality of Generating (QoG) of the generated data. The privacy preservation of the respective models is assessed by applying membership inference attacks to determine potential data leakage risks. Our experiments indicate the superiority of the privacy-preserving GAN (PPGAN) model over other models regarding privacy preservation while maintaining an acceptable level of QoG. The presented results can support better data protection for medical use cases in the future.
翻译:保护用户隐私数据对于实现高质量体验(QoE)和可接受性至关重要,尤其是在处理敏感数据的服务中,例如基于信息技术的健康服务。尽管匿名化技术已被证明容易遭受数据重识别攻击,但合成数据生成因其相对更低的资源与时间消耗以及对数据泄露更强的鲁棒性,已逐渐取代匿名化技术。生成对抗网络(GANs)已被用于生成合成数据集,特别是遵循差分隐私原则的GAN框架。本研究比较了当前最先进的基于GAN的合成数据生成模型,用于生成痴呆症患者的时间序列医疗记录,这些记录可在无隐私顾虑的情况下分发。通过预测模型、自相关分析和分布分析评估生成数据的生成质量(QoG)。通过应用成员推理攻击来评估各模型的隐私保护能力,以确定潜在的数据泄露风险。实验表明,隐私保护型GAN(PPGAN)模型在保持可接受的QoG水平的同时,在隐私保护方面优于其他模型。所得结果可为未来医疗场景中更好的数据保护提供支持。