Public-data Assisted Private Stochastic Optimization: Power and Limitations

We study the limits and capability of public-data assisted differentially private (PA-DP) algorithms. Specifically, we focus on the problem of stochastic convex optimization (SCO) with either labeled or unlabeled public data. For complete/labeled public data, we show that any $(\epsilon,\delta)$-PA-DP has excess risk $\tilde{\Omega}\big(\min\big\{\frac{1}{\sqrt{n_{\text{pub}}}},\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon} \big\} \big)$, where $d$ is the dimension, ${n_{\text{pub}}}$ is the number of public samples, ${n_{\text{priv}}}$ is the number of private samples, and $n={n_{\text{pub}}}+{n_{\text{priv}}}$. These lower bounds are established via our new lower bounds for PA-DP mean estimation, which are of a similar form. Up to constant factors, these lower bounds show that the simple strategy of either treating all data as private or discarding the private data, is optimal. We also study PA-DP supervised learning with \textit{unlabeled} public samples. In contrast to our previous result, we here show novel methods for leveraging public data in private supervised learning. For generalized linear models (GLM) with unlabeled public data, we show an efficient algorithm which, given $\tilde{O}({n_{\text{priv}}}\epsilon)$ unlabeled public samples, achieves the dimension independent rate $\tilde{O}\big(\frac{1}{\sqrt{{n_{\text{priv}}}}} + \frac{1}{\sqrt{{n_{\text{priv}}}\epsilon}}\big)$. We develop new lower bounds for this setting which shows that this rate cannot be improved with more public samples, and any fewer public samples leads to a worse rate. Finally, we provide extensions of this result to general hypothesis classes with finite fat-shattering dimension with applications to neural networks and non-Euclidean geometries.

翻译：我们研究了公共数据辅助的差分隐私（PA-DP）算法的极限与能力。具体而言，我们聚焦于带标签或无标签公共数据的随机凸优化问题。对于完整/带标签的公共数据，我们证明任何 $(\epsilon,\delta)$-PA-DP 算法的超额风险为 $\tilde{\Omega}\big(\min\big\{\frac{1}{\sqrt{n_{\text{pub}}}},\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\epsilon} \big\} \big)$，其中 $d$ 是维度，${n_{\text{pub}}}$ 是公共样本数量，${n_{\text{priv}}}$ 是私有样本数量，$n={n_{\text{pub}}}+{n_{\text{priv}}}$。这些下界通过我们为 PA-DP 均值估计建立的新下界（形式类似）得到。在常数因子范围内，这些下界表明简单策略（要么将所有数据视为私有数据，要么丢弃私有数据）是最优的。我们还研究了使用*无标签*公共样本的 PA-DP 监督学习。与之前的结果相反，我们在此展示了在私有监督学习中利用公共数据的新方法。对于具有无标签公共数据的广义线性模型（GLM），我们给出了一种高效算法，给定 $\tilde{O}({n_{\text{priv}}}\epsilon)$ 个无标签公共样本时，该算法能达到与维度无关的速率 $\tilde{O}\big(\frac{1}{\sqrt{{n_{\text{priv}}}}} + \frac{1}{\sqrt{{n_{\text{priv}}}\epsilon}}\big)$。我们为该设定开发了新下界，表明该速率无法通过更多公共样本改进，且任何更少的公共样本都会导致更差速率。最后，我们将此结果推广到具有有限肥瘦维数的一般假设类，并应用于神经网络和非欧几何场景。