Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set while having access to side public data? 2. How can we harness public data to improve DP model training in practice? We consider these questions in both the local and central models of pure and approximate DP. To answer the first question, we prove tight (up to log factors) lower and upper bounds that characterize the optimal error rates of three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex optimization. We show that the optimal error rates can be attained (up to log factors) by either discarding private data and training a public model, or treating public data like it is private and using an optimal DP algorithm. To address the second question, we develop novel algorithms that are "even more optimal" (i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean estimation, our algorithm is optimal including constants. Empirically, our algorithms show benefits over the state-of-the-art.
翻译:差分隐私(DP)确保机器学习模型的训练不会泄露私有数据。在实际应用中,我们可能获得无需考虑隐私问题的辅助公共数据。本文假设给定数量的公共数据可用,并解决了以下两个基本开放问题:1. 在访问辅助公共数据的同时,基于私有数据集训练的DP模型的最优(最坏情况)误差是多少?2. 在实践中如何利用公共数据改进DP模型训练?我们在纯差分隐私与近似差分隐私的本地和中心模型中探讨这些问题。针对第一个问题,我们证明了刻画三个基本问题(均值估计、经验风险最小化和随机凸优化)最优误差率的紧致(对数因子内)下界和上界。结果表明,最优误差率(对数因子内)可通过两种方式实现:要么丢弃私有数据并训练公共模型,要么将公共数据视为私有数据并使用最优DP算法。针对第二个问题,我们开发了比上述渐近最优方法“更优”(即具有更好的常数因子)的新算法。对于本地DP均值估计,我们的算法在常数意义上也是最优的。实验表明,我们的算法优于现有最优方法。