Protecting confidential data while preserving utility is particularly challenging when data sets contain outlying observations. Existing latent space anonymization methods, such as spectral anonymization (SA), rely on principal component analysis (PCA) and may therefore be vulnerable to contamination. We investigate anonymization in the presence of outliers and propose ICSA, a robust alternative to SA based on invariant coordinate selection (ICS). By replacing the PCA transformation with ICS, the robustness of the anonymization procedure can be regulated through the choice of scatter matrices. Alongside the methodological development, we derive a theoretical result showing that SA fails under sufficiently influential outliers. To assess the practical implications of this result, we compare the privacy-utility trade-off of ICSA and SA through simulation experiments under varying contamination settings and outlier severities. Our findings indicate that implementations of ICSA based on robust scatter matrices achieve stronger privacy protection than SA, while typically maintaining comparable, and in some cases improved, utility. We further examine the empirical performance of the proposed method using a benchmark clinical data set, where ICSA demonstrates superior overall privacy-utility efficiency relative to SA. These results suggest that explicitly accounting for outliers can materially improve anonymization performance and that robust latent space transformations offer a promising direction for privacy-preserving statistical data release.
翻译:保护机密数据同时保持其可用性在数据集中包含异常观测值时尤为困难。现有潜空间匿名化方法(如谱匿名化SA)依赖于主成分分析(PCA),因此可能易受异常污染。我们研究了存在异常值时的匿名化问题,并提出ICSA——一种基于不变坐标选择(ICS)的SA鲁棒替代方案。通过用ICS替代PCA变换,可通过选择散布矩阵来调节匿名化过程的鲁棒性。在方法学发展的同时,我们推导出理论结果,表明在足够强的异常值影响下SA会失效。为评估该结果的实际意义,我们通过模拟实验在多种污染设置和异常严重程度下比较了ICSA与SA的隐私-效用权衡。研究结果表明,基于鲁棒散布矩阵实现的ICSA能比SA实现更强的隐私保护,同时通常能保持相当甚至更优的效用。我们进一步使用基准临床数据集检验了所提方法的实证表现,结果表明ICSA在整体隐私-效用效率上优于SA。这些发现表明,明确考虑异常值能实质性改善匿名化性能,且鲁棒潜空间变换为隐私保护的统计数据发布提供了有前景的研究方向。