Protecting user data privacy can be achieved via many methods, from statistical transformations to generative models. However, all of them have critical drawbacks. For example, creating a transformed data set using traditional techniques is highly time-consuming. Also, recent deep learning-based solutions require significant computational resources in addition to long training phases, and differentially private-based solutions may undermine data utility. In this paper, we propose $\epsilon$-PrivateSMOTE, a technique designed for safeguarding against re-identification and linkage attacks, particularly addressing cases with a high re-identification risk. Our proposal combines synthetic data generation via noise-induced interpolation to obfuscate high-risk cases while maximising the data utility of the original data. Compared to multiple traditional and state-of-the-art privacy-preservation methods on 17 data sets, $\epsilon$-PrivateSMOTE achieves competitive results in privacy risk and better predictive performance than generative adversarial networks, variational autoencoders, and differential privacy baselines. It also improves energy consumption and time requirements by at least a factor of 11 and 15, respectively.
翻译:保护用户数据隐私可通过多种方法实现,从统计变换到生成模型。然而,这些方法都存在关键缺陷。例如,使用传统技术创建变换后的数据集非常耗时。此外,近期基于深度学习的解决方案除了需要较长的训练阶段外,还要求大量计算资源,而基于差分隐私的方法可能损害数据效用。本文提出 $\epsilon$-PrivateSMOTE技术,旨在防范重识别和链接攻击,尤其针对重识别风险较高的情况。该方法通过噪声诱导插值生成合成数据以混淆高风险案例,同时最大化原始数据的效用。在17个数据集上与多种传统及最新隐私保护方法相比,$\epsilon$-PrivateSMOTE在隐私风险方面取得了具有竞争力的结果,且在预测性能上优于生成对抗网络、变分自编码器和差分隐私基线方法。该方法还将能耗和时间需求分别降低了至少11倍和15倍。