Sharing data can often enable compelling applications and analytics. However, more often than not, valuable datasets contain information of sensitive nature, and thus sharing them can endanger the privacy of users and organizations. A possible alternative gaining momentum in the research community is to share synthetic data instead. The idea is to release artificially generated datasets that resemble the actual data -- more precisely, having similar statistical properties. So how do you generate synthetic data? What is that useful for? What are the benefits and the risks? What are the open research questions that remain unanswered? In this article, we provide a gentle introduction to synthetic data and discuss its use cases, the privacy challenges that are still unaddressed, and its inherent limitations as an effective privacy-enhancing technology.
翻译:共享数据通常能够支持引人注目的应用和分析。然而,更多情况下,有价值的数据集包含敏感性质的信息,因此共享它们可能危及用户和组织的隐私。研究界中一个日益受到关注的替代方案是共享合成数据。其理念是发布与真实数据相似的人工生成数据集——更准确地说,是具有相似的统计属性。那么,如何生成合成数据?它有什么用途?其好处和风险是什么?哪些未解决的研究问题仍然悬而未决?在本文中,我们将简要介绍合成数据,讨论其用例、尚未解决的隐私挑战,以及它作为一项有效的隐私增强技术所固有的局限性。