Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy ($\varepsilon\le1)$ and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of \textit{pure DP}. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL's high performance under strong privacy guarantees in challenging private learning setups
翻译:机器学习(ML)模型已被证明会从其训练数据集中泄露隐私信息。差分隐私(DP)通常通过差分隐私随机梯度下降算法(DP-SGD)实现,已成为限制模型信息泄露的标准解决方案。尽管近期有所改进,基于DP-SGD的隐私学习方法在高隐私($\varepsilon\le1$)和低数据场景下,以及当私有训练数据集存在不平衡时,通常仍面临困难。为克服这些限制,我们提出差分隐私原型学习(DPPL)作为隐私迁移学习的新范式。DPPL利用公开预训练的编码器从私有数据中提取特征,并生成代表嵌入空间中每个私有类别的差分隐私原型,这些原型可公开发布用于推理。由于我们的差分隐私原型仅需少量私有训练数据点即可获得,且无需迭代添加噪声,即使在\textit{纯差分隐私}概念下也能提供高效用预测和强隐私保证。我们进一步证明,当利用编码器预训练之外的公开数据时,隐私-效用权衡可得到进一步改善:特别是我们可以从用于训练编码器的公开可用数据点中隐私地采样差分隐私原型。我们使用四种最先进的编码器、四个视觉数据集,在不同数据和失衡场景下的实验评估表明,DPPL在具有挑战性的隐私学习设置中,能在强隐私保证下实现高性能。