Disclosure risk assessment with Bayesian non-parametric hierarchical modelling

Micro and survey datasets often contain private information about individuals, like their health status, income or political preferences. Previous studies have shown that, even after data anonymization, a malicious intruder could still be able to identify individuals in the dataset by matching their variables to external information. Disclosure risk measures are statistical measures meant to quantify how big such a risk is for a specific dataset. One of the most common measures is the number of sample unique values that are also population-unique. \cite{Man12} have shown how mixed membership models can provide very accurate estimates of this measure. A limitation of that approach is that the number of extreme profiles has to be chosen by the modeller. In this article, we propose a non-parametric version of the model, based on the Hierarchical Dirichlet Process (HDP). The proposed approach does not require any tuning parameter or model selection step and provides accurate estimates of the disclosure risk measure, even with samples as small as 1$\%$ of the population size. Moreover, a data augmentation scheme to address the presence of structural zeros is presented. The proposed methodology is tested on a real dataset from the New York census.

翻译：微观与调查数据集常包含个体的私人信息，如健康状况、收入或政治倾向。先前研究表明，即使经过数据匿名化处理，恶意入侵者仍可能通过将数据变量与外部信息匹配来识别数据集中的个体。披露风险度量是一种旨在量化特定数据集此类风险大小的统计指标。其中最常见的度量之一是样本唯一值同时为总体唯一值的数量。\cite{Man12} 已证明混合隶属度模型如何能够提供该度量的高精度估计。该方法的一个局限在于极端剖面的数量需由建模者预先设定。本文提出一种基于分层狄利克雷过程（HDP）的非参数化模型版本。所提方法无需任何调参或模型选择步骤，即使样本量小至总体规模的1$\%$，仍能提供披露风险度量的精确估计。此外，本文提出了处理结构性零值的数据增强方案。所提方法在纽约人口普查的真实数据集上进行了验证。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【ACL2020】多模态信息抽取，365页ppt

专知会员服务

151+阅读 · 2020年7月6日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日

【AI应用】Facebook-利用神经网络求解高等数学方程, Using neural networks to solve advanced mathematics equations

专知会员服务

34+阅读 · 2020年1月15日