The increased application of machine learning (ML) in sensitive domains requires protecting the training data through privacy frameworks, such as differential privacy (DP). DP requires to specify a uniform privacy level $\varepsilon$ that expresses the maximum privacy loss that each data point in the entire dataset is willing to tolerate. Yet, in practice, different data points often have different privacy requirements. Having to set one uniform privacy level is usually too restrictive, often forcing a learner to guarantee the stringent privacy requirement, at a large cost to accuracy. To overcome this limitation, we introduce our novel Personalized-DP Output Perturbation method (PDP-OP) that enables to train Ridge regression models with individual per data point privacy levels. We provide rigorous privacy proofs for our PDP-OP as well as accuracy guarantees for the resulting model. This work is the first to provide such theoretical accuracy guarantees when it comes to personalized DP in machine learning, whereas previous work only provided empirical evaluations. We empirically evaluate PDP-OP on synthetic and real datasets and with diverse privacy distributions. We show that by enabling each data point to specify their own privacy requirement, we can significantly improve the privacy-accuracy trade-offs in DP. We also show that PDP-OP outperforms the personalized privacy techniques of Jorgensen et al. (2015).
翻译:机器学习在敏感领域的应用日益广泛,这要求通过差分隐私等隐私框架来保护训练数据。差分隐私需要指定统一的隐私级别$\varepsilon$,用以表示整个数据集中每个数据点愿意承受的最大隐私损失。然而在实践中,不同数据点往往具有不同的隐私需求。设定统一的隐私级别通常过于严格,常迫使学习器为满足最严格的隐私要求而付出巨大的精度代价。为克服这一限制,我们提出了新型个性化差分隐私输出扰动方法(PDP-OP),该方法能够在为每个数据点设定独立隐私级别的前提下训练岭回归模型。我们为PDP-OP提供了严格的隐私证明,并给出了所得模型的精度保证。这项工作是首个在机器学习个性化差分隐私领域提供此类理论精度保证的研究,而先前的工作仅进行了实证评估。我们在合成数据集和真实数据集上,针对多种隐私分布对PDP-OP进行了实证评估。结果表明,通过允许每个数据点指定自身的隐私需求,我们能够显著改善差分隐私中的隐私-精度权衡。我们还证明,PDP-OP优于Jorgensen等人(2015年)提出的个性化隐私技术。