Statistical properties and privacy guarantees of an original distance-based fully synthetic data generation method

Introduction: The amount of data generated by original research is growing exponentially. Publicly releasing them is recommended to comply with the Open Science principles. However, data collected from human participants cannot be released as-is without raising privacy concerns. Fully synthetic data represent a promising answer to this challenge. This approach is explored by the French Centre de Recherche en {\'E}pid{\'e}miologie et Sant{\'e} des Populations in the form of a synthetic data generation framework based on Classification and Regression Trees and an original distance-based filtering. The goal of this work was to develop a refined version of this framework and to assess its risk-utility profile with empirical and formal tools, including novel ones developed for the purpose of this evaluation.Materials and Methods: Our synthesis framework consists of four successive steps, each of which is designed to prevent specific risks of disclosure. We assessed its performance by applying two or more of these steps to a rich epidemiological dataset. Privacy and utility metrics were computed for each of the resulting synthetic datasets, which were further assessed using machine learning approaches.Results: Computed metrics showed a satisfactory level of protection against attribute disclosure attacks for each synthetic dataset, especially when the full framework was used. Membership disclosure attacks were formally prevented without significantly altering the data. Machine learning approaches showed a low risk of success for simulated singling out and linkability attacks. Distributional and inferential similarity with the original data were high with all datasets.Discussion: This work showed the technical feasibility of generating publicly releasable synthetic data using a multi-step framework. Formal and empirical tools specifically developed for this demonstration are a valuable contribution to this field. Further research should focus on the extension and validation of these tools, in an effort to specify the intrinsic qualities of alternative data synthesis methods.Conclusion: By successfully assessing the quality of data produced using a novel multi-step synthetic data generation framework, we showed the technical and conceptual soundness of the Open-CESP initiative, which seems ripe for full-scale implementation.

翻译：引言：原始研究生成的数据量呈指数级增长。根据开放科学原则，建议公开发布这些数据。然而，从人类参与者收集的数据不能直接发布，否则会引发隐私问题。全合成数据是应对这一挑战的有前景的方案。法国流行病学与人口健康研究中心（Centre de Recherche en Épidémiologie et Santé des Populations）以基于分类与回归树及原始距离过滤器的合成数据生成框架形式探索了该方法。本工作旨在完善该框架，并利用包括为此评估开发的新型工具在内的经验和形式化工具评估其风险-效用概况。材料与方法：我们的合成框架包含四个连续步骤，每步均设计用于防止特定披露风险。通过将其中两步或更多步骤应用于丰富的流行病学数据集，评估其性能。对每个生成的合成数据集计算隐私和效用指标，并进一步使用机器学习方法进行评估。结果：计算指标显示，每个合成数据集对属性披露攻击具有满意的保护水平，特别在使用完整框架时。成员关系披露攻击被形式化阻止，且未显著改变数据。机器学习方法显示，针对模拟的单独识别和可链接性攻击的成功风险较低。所有数据集均与原始数据在分布和推断相似性上高度一致。讨论：本工作证明了使用多步骤框架生成可公开发布的合成数据的技术可行性。为此演示专门开发的形式和经验工具是该领域的重要贡献。进一步研究应关注这些工具的扩展与验证，以明确替代数据合成方法的内在特性。结论：通过成功评估使用新型多步骤合成数据生成框架产生的数据质量，我们展示了Open-CESP倡议在技术和概念上的合理性，该倡议似乎已具备全面实施的条件。