Optimal Differentially Private Learning with Public Data

Differential Privacy (DP) ensures that training a machine learning model does not leak private data. However, the cost of DP is lower model accuracy or higher sample complexity. In practice, we may have access to auxiliary public data that is free of privacy concerns. This has motivated the recent study of what role public data might play in improving the accuracy of DP models. In this work, we assume access to a given amount of public data and settle the following fundamental open questions: 1. What is the optimal (worst-case) error of a DP model trained over a private data set while having access to side public data? What algorithms are optimal? 2. How can we harness public data to improve DP model training in practice? We consider these questions in both the local and central models of DP. To answer the first question, we prove tight (up to constant factors) lower and upper bounds that characterize the optimal error rates of three fundamental problems: mean estimation, empirical risk minimization, and stochastic convex optimization. We prove that public data reduces the sample complexity of DP model training. Perhaps surprisingly, we show that the optimal error rates can be attained (up to constants) by either discarding private data and training a public model, or treating public data like it's private data and using an optimal DP algorithm. To address the second question, we develop novel algorithms which are "even more optimal" (i.e. better constants) than the asymptotically optimal approaches described above. For local DP mean estimation with public data, our algorithm is optimal including constants. Empirically, our algorithms show benefits over existing approaches for DP model training with side access to public data.

翻译：差分隐私（DP）确保机器学习模型训练不会泄露私有数据。然而，DP的代价是模型精度降低或样本复杂度增加。实践中，我们可能拥有不受隐私问题约束的辅助公开数据。这促使了近期关于公开数据在提升DP模型精度中作用的研究。本文中，我们假设可访问给定量的公开数据，并解决了以下基础性开放问题：1. 在访问辅助公开数据的情况下，基于私有数据集训练的DP模型的理论最优（最坏情况）误差是多少？何种算法达到最优？2. 实践中如何利用公开数据提升DP模型训练？我们在DP的本地模型和中央模型中都考虑了这些问题。针对第一个问题，我们证明了三个基础问题（均值估计、经验风险最小化和随机凸优化）的最优误差率的紧致（至多常数因子）上下界。我们证明公开数据降低了DP模型训练的样本复杂度。令人惊讶的是，我们发现最优误差率可通过丢弃私有数据并训练公开模型，或将其视为私有数据并使用最优DP算法达到（至多常数因子）。针对第二个问题，我们开发了新颖算法，其性能"甚至更优"（即常数更佳），优于上述渐近最优方法。在带公开数据的本地DP均值估计中，我们的算法在常数意义上也是最优的。实验表明，我们的算法在利用辅助公开数据进行DP模型训练方面优于现有方法。