DP-CDA: An Algorithm for Enhanced Privacy Preservation in Dataset Synthesis Through Randomized Mixing

In recent years, the growth of data across various sectors, including healthcare, security, finance, and education, has created significant opportunities for analysis and informed decision-making. However, these datasets often contain sensitive and personal information, which raises serious privacy concerns. It has been shown in multiple works that a person's identity is intertwined with their data, even if the data is anonymized. Due to this lack of separation between a person's identity and their information, the patterns associated with an individual's information can uniquely identify them. Protecting individual privacy is crucial, yet many existing machine learning and data publishing algorithms struggle with high-dimensional data, facing challenges related to the trade-off between computational efficiency and privacy. To address these challenges, we introduce an effective data publishing algorithm \emph{DP-CDA}. Our proposed algorithm generates synthetic data by randomly mixing the privacy-sensitive data in a class-specific manner and inducing carefully tuned randomness to ensure formal privacy guarantees. Our comprehensive privacy accounting shows that the proposed DP-CDA provides a stronger privacy guarantee compared to existing methods, allowing for better utility while maintaining a stricter level of privacy. To evaluate the effectiveness of DP-CDA, we examine the accuracy of predictive models trained on the synthetic data, which serves as a measure of dataset utility. Importantly, we identify an optimal order of mixing that balances privacy-utility trade-off. Our results indicate that synthetic datasets produced using the DP-CDA can achieve superior utility compared to those generated by conventional data publishing algorithms, even when subject to the same privacy requirements.

翻译：近年来，医疗、安全、金融、教育等各领域的数据增长为分析与知情决策创造了重大机遇。然而，这些数据集通常包含敏感和个人信息，引发了严重的隐私担忧。多项研究表明，即使数据经过匿名化处理，个人身份仍与其数据密不可分。由于个人身份与其信息之间缺乏分离性，与个体信息相关的模式能够唯一识别其身份。保护个人隐私至关重要，但许多现有机器学习与数据发布算法在处理高维数据时面临计算效率与隐私保护权衡的挑战。为解决这些问题，我们提出了一种高效的数据发布算法DP-CDA。该算法通过按类别特定方式随机混合隐私敏感数据，并引入精心调整的随机性来确保形式化隐私保证。我们的综合隐私核算表明，与现有方法相比，DP-CDA提供了更强的隐私保障，在维持更严格隐私等级的同时获得更好的数据效用。为评估DP-CDA的有效性，我们检验了基于合成数据训练的预测模型的准确性——该指标用于衡量数据集的效用。重要的是，我们确定了平衡隐私-效用权衡的最优混合顺序。结果表明，即使在相同的隐私要求下，采用DP-CDA生成的合成数据集相比传统数据发布算法能实现更优的效用。