Randomized response is a popular local anonymization approach that can deliver anonymized multi-dimensional data sets with rigorous privacy guarantees. At the same time, it can ensure validity for exploratory analysis and machine learning tasks as, under fairly general conditions, unbiased estimates of the underlying true distributions can be retrieved. However, and like for many other anonymization techniques, one of the main pitfalls of this approach is the curse of dimensionality. When coping with data sets with many attributes, one quickly runs into unsustainable computational costs for estimating true distributions, as well as a degradation in their accuracies. Relying on new theoretical insights developed in this paper, we propose an approach to multi-dimensional randomized response that avoids these traditional limitations. From simple yet intuitive parameterizations of the randomization matrices that we introduce, we develop a protocol called Lambda-randomization that entails low computational costs to retrieve estimates of multivariate distributions, and that makes use of solely three simple elements: a set of parameters ranging between 0 and 1 (one per attribute of the data set), the identity matrix, and the all-ones vector. We also present an empirical application to illustrate the proposed protocol.
翻译:随机响应是一种流行的本地匿名化方法,能够提供具有严格隐私保证的匿名化多维数据集。同时,在相当一般的条件下,由于可以恢复潜在真实分布的无偏估计,它能确保探索性分析和机器学习任务的有效性。然而,与许多其他匿名化技术类似,该方法的主要缺陷之一在于维度灾难。当处理具有大量属性的数据集时,快速面临不可持续的真实分布估计计算成本,以及其准确性的下降。基于本文提出的新理论见解,我们提出了一种避免这些传统局限的多维随机响应方法。通过引入随机化矩阵的简单而直观的参数化,我们开发了一种称为Lambda-随机化的协议,该协议以较低的计算成本获取多元分布的估计,并仅使用三个简单元素:一组介于0和1之间的参数(数据集的每个属性对应一个)、单位矩阵和全1向量。我们还提供了一个实证应用以说明所提出的协议。