The exponential growth of collected, processed, and shared data has given rise to concerns about individuals' privacy. Consequently, various laws and regulations have been established to oversee how organizations handle and safeguard data. One such method is Statistical Disclosure Control, which aims to minimize the risk of exposing confidential information by de-identifying it. This de-identification is achieved through specific privacy-preserving techniques. However, a trade-off exists: de-identified data can often lead to a loss of information, which might impact the accuracy of data analysis and the predictive capability of models. The overarching goal remains to safeguard individual privacy while preserving the data's interpretability, meaning its overall usefulness. Despite advances in Statistical Disclosure Control, the field continues to evolve, with no definitive solution that strikes an optimal balance between privacy and utility. This survey delves into the intricate processes of de-identification. We outline the current privacy-preserving techniques employed in microdata de-identification, delve into privacy measures tailored for various disclosure scenarios, and assess metrics for information loss and predictive performance. Herein, we tackle the primary challenges posed by privacy constraints, overview predominant strategies to mitigate these challenges, categorize privacy-preserving techniques, offer a theoretical assessment of current comparative research, and highlight numerous unresolved issues in the domain.
翻译:收集、处理与共享数据的指数级增长引发了对个人隐私的担忧。为此,各类法律法规相继出台,以规范组织对数据的处理和保护方式。其中,统计披露控制作为一种方法,旨在通过去标识化降低泄露机密信息的风险。这一去标识化过程依赖于特定的隐私保护技术。然而,存在一个权衡:去标识化后的数据往往会导致信息损失,可能影响数据分析的准确性及模型的预测能力。根本目标仍是在保护个人隐私的同时保持数据的可解释性,即其整体可用性。尽管统计披露控制取得了一定进展,该领域仍在持续演化,尚无能在隐私与效用之间实现最优平衡的确定性方案。本综述深入探讨去标识化的复杂流程。我们概述了当前用于微数据去标识化的隐私保护技术,深入探讨了针对不同披露场景的隐私度量方法,并评估了信息损失与预测性能的指标。在此,我们应对隐私约束带来的主要挑战,总结缓解这些挑战的主流策略,对隐私保护技术进行分类,对当前比较研究进行理论评估,并指出该领域众多尚未解决的问题。