In the contemporary data landscape characterized by multi-source data collection and third-party sharing, ensuring individual privacy stands as a critical concern. While various anonymization methods exist, their utility preservation and privacy guarantees remain challenging to quantify. In this work, we address this gap by studying the utility and privacy of the spectral anonymization (SA) algorithm, particularly in an asymptotic framework. Unlike conventional anonymization methods that directly modify the original data, SA operates by perturbing the data in a spectral basis and subsequently reverting them to their original basis. Alongside the original version $\mathcal{P}$-SA, employing random permutation transformation, we introduce two novel SA variants: $\mathcal{J}$-spectral anonymization and $\mathcal{O}$-spectral anonymization, which employ sign-change and orthogonal matrix transformations, respectively. We show how well, under some practical assumptions, these SA algorithms preserve the first and second moments of the original data. Our results reveal, in particular, that the asymptotic efficiency of all three SA algorithms in covariance estimation is exactly 50% when compared to the original data. To assess the applicability of these asymptotic results in practice, we conduct a simulation study with finite data and also evaluate the privacy protection offered by these algorithms using distance-based record linkage. Our research reveals that while no method exhibits clear superiority in finite-sample utility, $\mathcal{O}$-SA distinguishes itself for its exceptional privacy preservation, never producing identical records, albeit with increased computational complexity. Conversely, $\mathcal{P}$-SA emerges as a computationally efficient alternative, demonstrating unmatched efficiency in mean estimation.
翻译:在当今以多源数据收集和第三方共享为特征的数据环境中,确保个体隐私已成为一个关键问题。尽管存在多种匿名化方法,但其效用保持和隐私保障仍难以量化。在本工作中,我们通过研究谱匿名化(SA)算法的效用和隐私来填补这一空白,特别是在渐近框架下。与直接修改原始数据的传统匿名化方法不同,SA通过在谱基上扰动数据,随后将其转换回原始基来操作。除了采用随机置换变换的原始版本 $\mathcal{P}$-SA 外,我们引入了两种新的SA变体:$\mathcal{J}$-谱匿名化和 $\mathcal{O}$-谱匿名化,它们分别采用符号变换和正交矩阵变换。我们展示了在某些实际假设下,这些SA算法对原始数据的一阶矩和二阶矩的保持程度。我们的结果表明,具体而言,与原始数据相比,所有三种SA算法在协方差估计中的渐近效率恰好为50%。为了评估这些渐近结果在实际中的适用性,我们使用有限数据进行了模拟研究,并基于距离的记录链接评估了这些算法提供的隐私保护。我们的研究表明,尽管在有限样本效用方面没有方法表现出明显的优越性,但 $\mathcal{O}$-SA 以其卓越的隐私保护能力脱颖而出,从不产生相同的记录,尽管计算复杂度有所增加。相反,$\mathcal{P}$-SA 作为一种计算高效的替代方案,在均值估计中展现出无与伦比的效率。