Synthetic Data Generation (SDG) can be used to facilitate privacy-preserving data sharing. However, most existing research focuses on privacy attacks where the adversary is the recipient of the released synthetic data and attempts to infer sensitive information from it. This study investigates quality degradation attacks initiated by adversaries who possess access to the real dataset or control over the generation process, such as the data owner, the synthetic data provider, or potential intruders. We formalize a corresponding threat model and empirically evaluate the effectiveness of targeted manipulations of real data (e.g., label flipping and feature-importance-based interventions) on the quality of generated synthetic data. The results show that even small perturbations can substantially reduce downstream predictive performance and increase statistical divergence, exposing vulnerabilities within SDG pipelines. This study highlights the need to integrate integrity verification and robustness mechanisms, alongside privacy protection, to ensure the reliability and trustworthiness of synthetic data sharing frameworks.
翻译:合成数据生成(SDG)可用于促进隐私保护的数据共享。然而,现有研究大多关注隐私攻击,即攻击者作为已发布合成数据的接收方,试图从中推断敏感信息。本研究探讨了由能够访问真实数据集或控制生成过程的攻击者(例如数据所有者、合成数据提供者或潜在入侵者)发起的质量降级攻击。我们形式化了相应的威胁模型,并通过实验评估了对真实数据进行针对性操纵(例如标签翻转和基于特征重要性的干预)对生成合成数据质量的影响。结果表明,即使微小的扰动也能显著降低下游预测性能并增加统计差异,从而暴露了SDG流程中的脆弱性。本研究强调,除了隐私保护外,还需要整合完整性验证和鲁棒性机制,以确保合成数据共享框架的可靠性与可信度。