Differential privacy (DP) ensures that training a machine learning model does not leak private data. In practice, we may have access to auxiliary public data that is free of privacy concerns. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set while having access to side public data? 2. How can we harness public data to improve DP model training in practice? We consider these questions in both the local and central models of pure and approximate DP. To answer the first question, we prove tight (up to log factors) lower and upper bounds that characterize the optimal error rates of three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex optimization. We show that the optimal error rates can be attained (up to log factors) by either discarding private data and training a public model, or treating public data like it is private and using an optimal DP algorithm. To address the second question, we develop novel algorithms that are "even more optimal" (i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean estimation, our algorithm is \ul{optimal including constants}. Empirically, our algorithms show benefits over the state-of-the-art.
翻译:差分隐私(DP)确保训练机器学习模型不会泄露私有数据。在实践中,我们可能拥有不受隐私问题影响的辅助公共数据。在本研究中,我们假设可获得一定量的公共数据,并解决以下基本开放问题:1. 在拥有侧公共数据的情况下,基于私有数据集训练的DP模型的最优(最坏情况)误差是多少?2. 如何在实践中利用公共数据改进DP模型训练?我们在纯DP和近似DP的本地模型与中央模型中考虑这些问题。为回答第一个问题,我们证明了(至多对数因子紧的)下界与上界,这些界刻画了三个基本问题的最优误差率:均值估计、经验风险最小化和随机凸优化。我们表明,最优误差率(至多对数因子)可以通过舍弃私有数据并训练公共模型,或如同处理私有数据般对待公共数据并使用最优DP算法来实现。针对第二个问题,我们开发了新颖算法,这些算法相较于上述渐近最优方法“甚至更优”(即常数更优)。对于本地DP均值估计,我们的算法在包括常数方面达到最优。实验表明,我们的算法优于当前最先进方法。