Anonymization techniques based on obfuscating the quasi-identifiers by means of value generalization hierarchies are widely used to achieve preset levels of privacy. To prevent different types of attacks against database privacy it is necessary to apply several anonymization techniques beyond the classical k-anonymity or $\ell$-diversity. However, the application of these methods is directly connected to a reduction of their utility in prediction and decision making tasks. In this work we study four classical machine learning methods currently used for classification purposes in order to analyze the results as a function of the anonymization techniques applied and the parameters selected for each of them. The performance of these models is studied when varying the value of k for k-anonymity and additional tools such as $\ell$-diversity, t-closeness and $\delta$-disclosure privacy are also deployed on the well-known adult dataset.
翻译:基于值泛化层次混淆准标识符的匿名化技术被广泛应用于实现预设隐私保护级别。为防止针对数据库隐私的不同类型攻击,除经典的k-匿名或ℓ-多样性之外,还需应用多种匿名化技术。然而,这些方法的应用直接导致其在预测和决策任务中的效用降低。本研究针对当前分类任务中四种经典机器学习方法展开分析,探究不同匿名化技术及其参数选择对模型结果的影响。通过调整k-匿名中的k值,并在经典成人数数据集上额外部署ℓ-多样性、t-紧密度及δ-披露隐私等增强工具,系统研究了这些模型的性能变化规律。