Synthetic data is often perceived as a silver-bullet solution to data anonymization and privacy-preserving data publishing. Drawn from generative models like diffusion models, synthetic data is expected to preserve the statistical properties of the original dataset while remaining resilient to privacy attacks. Recent developments of diffusion models have been effective on a wide range of data types, but their privacy resilience, particularly for tabular formats, remains largely unexplored. MIDST challenge sought a quantitative evaluation of the privacy gain of synthetic tabular data generated by diffusion models, with a specific focus on its resistance to membership inference attacks (MIAs). Given the heterogeneity and complexity of tabular data, multiple target models were explored for MIAs, including diffusion models for single tables of mixed data types and multi-relational tables with interconnected constraints. MIDST inspired the development of novel black-box and white-box MIAs tailored to these target diffusion models as a key outcome, enabling a comprehensive evaluation of their privacy efficacy. The MIDST GitHub repository is available at https://github.com/VectorInstitute/MIDST
翻译:合成数据常被视为数据匿名化和隐私保护数据发布的银弹解决方案。基于扩散模型等生成模型,合成数据预期能保留原始数据集的统计特性,同时抵御隐私攻击。扩散模型的最新发展已对多种数据类型有效,但其隐私鲁棒性——特别是针对表格格式——在很大程度上仍未得到探索。MIDST挑战赛旨在定量评估扩散模型生成的表格合成数据的隐私增益,重点关注其对成员推断攻击(MIA)的抵抗能力。鉴于表格数据的异质性和复杂性,针对MIA探索了多种目标模型,包括混合数据类型单表的扩散模型和具有互联约束的多关系表。作为关键成果,MIDST推动了针对这些目标扩散模型的新型黑盒与白盒MIA的开发,从而能够全面评估其隐私效能。MIDST GitHub代码库地址为:https://github.com/VectorInstitute/MIDST