Training generative machine learning models to produce synthetic tabular data has become a popular approach for enhancing privacy in data sharing. As this typically involves processing sensitive personal information, releasing either the trained model or generated synthetic datasets can still pose privacy risks. Yet, recent research, commercial deployments, and privacy regulations like the General Data Protection Regulation (GDPR) largely assess anonymity at the level of an individual dataset. In this paper, we rethink anonymity claims about synthetic data from a model-centric perspective and argue that meaningful assessments must account for the capabilities and properties of the underlying generative model and be grounded in state-of-the-art privacy attacks. This perspective better reflects real-world products and deployments, where trained models are often readily accessible for interaction or querying. We interpret the GDPR's definitions of personal data and anonymization under such access assumptions to identify the types of identifiability risks that must be mitigated and map them to privacy attacks across different threat settings. We then argue that synthetic data techniques alone do not ensure sufficient anonymization. Finally, we compare the two mechanisms most commonly used alongside synthetic data -- Differential Privacy (DP) and Similarity-based Privacy Metrics (SBPMs) -- and argue that while DP can offer robust protections against identifiability risks, SBPMs lack adequate safeguards. Overall, our work connects regulatory notions of identifiability with model-centric privacy attacks, enabling more responsible and trustworthy regulatory assessment of synthetic data systems by researchers, practitioners, and policymakers.
翻译:训练生成式机器学习模型以生成合成表格数据已成为增强数据共享隐私性的流行方法。由于这通常涉及处理敏感个人信息,发布训练好的模型或生成的合成数据集仍可能带来隐私风险。然而,近期研究、商业部署以及《通用数据保护条例》(GDPR)等隐私法规大多在单个数据集层面评估匿名性。本文从模型中心视角重新审视关于合成数据的匿名性主张,并认为有意义的评估必须考虑底层生成模型的能力与特性,并立足于最先进的隐私攻击方法。这一视角更好地反映了现实世界中的产品和部署场景,其中训练好的模型通常易于被交互或查询。我们在此类访问假设下解读GDPR对个人数据和匿名化的定义,以识别必须缓解的可识别性风险类型,并将其映射到不同威胁环境下的隐私攻击。随后,我们论证仅靠合成数据技术无法确保充分的匿名化。最后,我们比较了最常与合成数据结合使用的两种机制——差分隐私(DP)与基于相似性的隐私度量(SBPMs),并指出虽然DP能针对可识别性风险提供强有力保护,但SBPMs缺乏足够的防护能力。总体而言,我们的工作将监管层面的可识别性概念与模型中心的隐私攻击联系起来,为研究人员、从业者和政策制定者提供了对合成数据系统进行更负责任、更可信赖的监管评估的框架。