Synthetic data offers a promising solution to privacy concerns in healthcare by generating useful datasets in a privacy-aware manner. However, although synthetic data is typically developed with the intention of sharing said data, ambiguous reidentification risk assessments often prevent synthetic data from seeing the light of day. One of the main causes is that privacy metrics for synthetic data, which inform on reidentification risks, are not well-aligned with practical requirements and regulations regarding data sharing in healthcare. This article discusses the paradoxical situation where synthetic data is designed for data sharing but is often still restricted. We also discuss how the field should move forward to mitigate this issue.
翻译:合成数据通过隐私感知的方式生成有用数据集,为医疗保健领域的隐私问题提供了一种前景广阔的解决方案。然而,尽管合成数据通常以共享为目的进行开发,但模糊的再识别风险评估往往阻碍了合成数据的实际应用。其主要原因之一是,用于评估再识别风险的合成数据隐私度量指标,未能与医疗保健领域数据共享的实际需求和监管要求充分契合。本文探讨了合成数据虽为共享而设计却常受限制的这一悖论性现状,并讨论了该领域应如何推进以缓解此问题。