差分隐私原型用于不平衡迁移学习 (Differentially Private Prototypes for Imbalanced Transfer Learning)

Machine learning (ML) models have been shown to leak private information from their training datasets. Differential Privacy (DP), typically implemented through the differential private stochastic gradient descent algorithm (DP-SGD), has become the standard solution to bound leakage from the models. Despite recent improvements, DP-SGD-based approaches for private learning still usually struggle in the high privacy ($\varepsilon\le1)$ and low data regimes, and when the private training datasets are imbalanced. To overcome these limitations, we propose Differentially Private Prototype Learning (DPPL) as a new paradigm for private transfer learning. DPPL leverages publicly pre-trained encoders to extract features from private data and generates DP prototypes that represent each private class in the embedding space and can be publicly released for inference. Since our DP prototypes can be obtained from only a few private training data points and without iterative noise addition, they offer high-utility predictions and strong privacy guarantees even under the notion of \textit{pure DP}. We additionally show that privacy-utility trade-offs can be further improved when leveraging the public data beyond pre-training of the encoder: in particular, we can privately sample our DP prototypes from the publicly available data points used to train the encoder. Our experimental evaluation with four state-of-the-art encoders, four vision datasets, and under different data and imbalancedness regimes demonstrate DPPL's high performance under strong privacy guarantees in challenging private learning setups

翻译：机器学习模型已被证明会从其训练数据集中泄露隐私信息。差分隐私通常通过差分隐私随机梯度下降算法实现，已成为限制模型信息泄露的标准解决方案。尽管近期有所改进，但基于DP-SGD的隐私学习方法在高隐私场景、低数据量场景以及私有训练数据集不平衡的情况下仍面临挑战。为克服这些限制，我们提出差分隐私原型学习作为私有迁移学习的新范式。DPPL利用公开预训练的编码器从私有数据中提取特征，并生成代表嵌入空间中每个私有类别的差分隐私原型，这些原型可公开发布用于推理。由于我们的差分隐私原型仅需少量私有训练数据点且无需迭代添加噪声，即使在纯差分隐私概念下也能提供高效用预测和强隐私保证。我们进一步证明，当利用超出编码器预训练阶段的公开数据时，隐私-效用权衡可得到进一步改善：特别是我们可以从用于训练编码器的公开数据点中隐私采样差分隐私原型。通过使用四种先进编码器、四个视觉数据集，并在不同数据量和不平衡度场景下的实验评估，证明了DPPL在具有挑战性的隐私学习场景中，在强隐私保证下仍能保持高性能。