Synthetic data generation (SDG) has become increasingly popular as a privacy-enhancing technology. It aims to maintain important statistical properties of its underlying training data, while excluding any personally identifiable information. There have been a whole host of SDG algorithms developed in recent years to improve and balance both of these aims. Many of these algorithms provide robust differential privacy guarantees. However, we show here that if the differential privacy parameter $\varepsilon$ is set too high, then unambiguous privacy leakage can result. We show this by conducting a novel membership inference attack (MIA) on two state-of-the-art differentially private SDG algorithms: MST and PrivBayes. Our work suggests that there are vulnerabilities in these generators not previously seen, and that future work to strengthen their privacy is advisable. We present the heuristic for our MIA here. It assumes knowledge of auxiliary "population" data, and also assumes knowledge of which SDG algorithm was used. We use this information to adapt the recent DOMIAS MIA uniquely to MST and PrivBayes. Our approach went on to win the SNAKE challenge in November 2023.
翻译:合成数据生成作为隐私增强技术近年来日益普及,其目标是在剔除所有可识别个人信息的同时,保持基础训练数据的重要统计特性。为改进并平衡这两项诉求,近年来涌现了大量合成数据生成算法,其中许多算法提供了强健的差分隐私保障。然而,我们在此证明:若差分隐私参数ε设置过高,则可能引发明确的隐私泄露风险。通过对两种最先进的差分隐私合成数据生成算法——MST与PrivBayes——实施新型成员推断攻击,我们揭示了这些生成器此前未被发现的脆弱性,并建议未来应加强其隐私保护能力。本文阐述了所提成员推断攻击的启发式方法:该方法假设攻击者掌握辅助性"总体"数据知识,并知晓所采用的合成数据生成算法类型。我们利用这些信息将近期提出的DOMIAS成员推断攻击方法创新性地适配至MST与PrivBayes算法。本方案最终在2023年11月SNAKE挑战赛中胜出。