Micro and survey datasets often contain private information about individuals, like their health status, income or political preferences. Previous studies have shown that, even after data anonymization, a malicious intruder could still be able to identify individuals in the dataset by matching their variables to external information. Disclosure risk measures are statistical measures meant to quantify how big such a risk is for a specific dataset. One of the most common measures is the number of sample unique values that are also population-unique. \cite{Man12} have shown how mixed membership models can provide very accurate estimates of this measure. A limitation of that approach is that the number of extreme profiles has to be chosen by the modeller. In this article, we propose a non-parametric version of the model, based on the Hierarchical Dirichlet Process (HDP). The proposed approach does not require any tuning parameter or model selection step and provides accurate estimates of the disclosure risk measure, even with samples as small as 1$\%$ of the population size. Moreover, a data augmentation scheme to address the presence of structural zeros is presented. The proposed methodology is tested on a real dataset from the New York census.
翻译:微观与调查数据集常包含个体的私人信息,如健康状况、收入或政治倾向。先前研究表明,即使经过数据匿名化处理,恶意入侵者仍可能通过将数据变量与外部信息匹配来识别数据集中的个体。披露风险度量是一种旨在量化特定数据集此类风险大小的统计指标。其中最常见的度量之一是样本唯一值同时为总体唯一值的数量。\cite{Man12} 已证明混合隶属度模型如何能够提供该度量的高精度估计。该方法的一个局限在于极端剖面的数量需由建模者预先设定。本文提出一种基于分层狄利克雷过程(HDP)的非参数化模型版本。所提方法无需任何调参或模型选择步骤,即使样本量小至总体规模的1$\%$,仍能提供披露风险度量的精确估计。此外,本文提出了处理结构性零值的数据增强方案。所提方法在纽约人口普查的真实数据集上进行了验证。