The Collaborative Research Cycle (CRC) is a National Institute of Standards and Technology (NIST) benchmarking program intended to strengthen understanding of tabular data deidentification technologies. Deidentification algorithms are vulnerable to the same bias and privacy issues that impact other data analytics and machine learning applications, and can even amplify those issues by contaminating downstream applications. This paper summarizes four CRC contributions: theoretical work on the relationship between diverse populations and challenges for equitable deidentification; public benchmark data focused on diverse populations and challenging features; a comprehensive open source suite of evaluation metrology for deidentified datasets; and an archive of more than 450 deidentified data samples from a broad range of techniques. The initial set of evaluation results demonstrate the value of these tools for investigations in this field.
翻译:协作研究循环(CRC)是美国国家标准与技术研究院(NIST)的一项基准测试计划,旨在增强对表格数据去标识化技术的理解。去标识化算法与其他数据分析和机器学习应用一样,容易受到偏差和隐私问题的影响,甚至可能通过污染下游应用而放大这些问题。本文总结了CRC的四项贡献:关于多样化群体与公平去标识化挑战之间关系的理论研究;聚焦于多样化群体和具有挑战性特征的公共基准数据集;一套全面的去标识化数据集评估计量开源工具;以及一个包含来自广泛技术生成的450多个去标识化数据样本的档案库。初步评估结果展示了这些工具对该领域研究的价值。