The Diverse Communities Data Excerpts are the core of a National Institute of Standards and Technology (NIST) program to strengthen understanding of tabular data deidentification technologies such as synthetic data. Synthetic data is an ambitious attempt to democratize the benefits of big data; it uses generative models to recreate sensitive personal data with new records for public release. However, it is vulnerable to the same bias and privacy issues that impact other machine learning applications, and can even amplify those issues. When deidentified data distributions introduce bias or artifacts, or leak sensitive information, they propagate these problems to downstream applications. Furthermore, real-world survey conditions such as diverse subpopulations, heterogeneous non-ordinal data spaces, and complex dependencies between features pose specific challenges for synthetic data algorithms. These observations motivate the need for real, diverse, and complex benchmark data to support a robust understanding of algorithm behavior. This paper introduces four contributions: new theoretical work on the relationship between diverse populations and challenges for equitable deidentification; public benchmark data focused on diverse populations and challenging features curated from the American Community Survey; an open source suite of evaluation metrology for deidentified datasets; and an archive of evaluation results on a broad collection of deidentification techniques. The initial set of evaluation results demonstrate the suitability of these tools for investigations in this field.
翻译:多样化社区数据摘录是美国国家标准与技术研究院(NIST)计划的核心,旨在加强对于表格数据去标识化技术(如合成数据)的理解。合成数据是一项雄心勃勃的尝试,旨在实现大数据优势的民主化;它利用生成模型重建敏感个人数据,生成可供公开发布的新记录。然而,它同样容易受到影响其他机器学习应用的偏差和隐私问题的制约,甚至可能放大这些问题。当去标识化数据分布引入偏差或伪影,或泄露敏感信息时,这些问题会传导至下游应用。此外,现实调查环境(如多样化的子群体、异构的非有序数据空间以及特征间的复杂依赖关系)对合成数据算法提出了特定挑战。这些观察结果凸显了对真实、多样且复杂的基准数据的需求,以支撑对算法行为的稳健理解。本文贡献了四个部分:关于多样化群体与公平去标识化挑战之间关系的新理论工作;基于美国社区调查数据梳理的、聚焦多样化群体及具有挑战性特征的公共基准数据;一套用于去标识化数据集评估的开源度量工具套件;以及针对广泛去标识化技术的评估结果档案。初步评估结果表明,这些工具适用于该领域的研究。