Traditional perturbative statistical disclosure control (SDC) approaches such as microaggregation, noise addition, rank swapping, etc, perturb the data in an ``ad-hoc" way in the sense that while they manage to preserve some particular aspects of the data, they end up modifying others. Synthetic data approaches based on the fully conditional specification data synthesis paradigm, on the other hand, aim to generate new datasets that follow the same joint probability distribution as the original data. These synthetic data approaches, however, rely either on parametric statistical models, or non-parametric machine learning models, which need to fit well the original data in order to generate credible and useful synthetic data. Another important drawback is that they tend to perform better when the variables are synthesized in the correct causal order (i.e., in the same order as the true data generating process), which is often unknown in practice. To circumvent these issues, we propose a fully non-parametric and model free perturbative SDC approach that approximates the joint distribution of the original data via sequential applications of restricted permutations to the numerical microdata (where the restricted permutations are guided by the joint distribution of a discretized version of the data). Empirical comparisons against popular SDC approaches, using both real and simulated datasets, suggest that the proposed approach is competitive in terms of the trade-off between confidentiality and data utility.
翻译:传统的扰动性统计披露控制(SDC)方法(如微观聚合、噪声添加、秩交换等)以"即兴式"方式扰动数据,即虽然能保留数据的某些特定方面,但最终会改变其他特征。另一方面,基于完全条件规范数据合成范式的合成数据方法旨在生成与原始数据具有相同联合概率分布的新数据集。然而,这些合成数据方法依赖于参数统计模型或非参数机器学习模型,需要良好拟合原始数据才能生成可信且有用的合成数据。另一个重要缺点是,当变量按照正确的因果顺序(即与真实数据生成过程相同的顺序)合成时,这些方法往往表现更佳,而该顺序在实际中通常是未知的。为规避这些问题,我们提出一种完全非参数且无模型的扰动性SDC方法,通过对数值微观数据依次应用受限置换(其中受限置换由离散化版本数据的联合分布引导)来逼近原始数据的联合分布。使用真实与模拟数据集与主流SDC方法的实证比较表明,所提方法在保密性与数据效用之间的权衡上具有竞争力。