Storage-efficient privacy-preserving learning is crucial due to increasing amounts of sensitive user data required for modern learning tasks. We propose a framework for reducing the storage cost of user data while at the same time providing privacy guarantees, without essential loss in the utility of the data for learning. Our method comprises noise injection followed by lossy compression. We show that, when appropriately matching the lossy compression to the distribution of the added noise, the compressed examples converge, in distribution, to that of the noise-free training data as the sample size of the training data (or the dimension of the training data) increases. In this sense, the utility of the data for learning is essentially maintained, while reducing storage and privacy leakage by quantifiable amounts. We present experimental results on the CelebA dataset for gender classification and find that our suggested pipeline delivers in practice on the promise of the theory: the individuals in the images are unrecognizable (or less recognizable, depending on the noise level), overall storage of the data is substantially reduced, with no essential loss (and in some cases a slight boost) to the classification accuracy. As an added bonus, our experiments suggest that our method yields a substantial boost to robustness in the face of adversarial test data.
翻译:存储高效的隐私保护学习至关重要,这是因为现代学习任务需要处理日益增长的敏感用户数据。我们提出了一种框架,在降低用户数据存储成本的同时提供隐私保障,且不显著损失数据的学习效用。该方法包含噪声注入与有损压缩两个步骤。我们证明,当有损压缩与所加噪声的分布适当匹配时,随着训练数据样本量(或训练数据维度)的增大,压缩后的样本在分布上收敛于无噪声训练数据分布。在此意义上,数据的学习效用基本得以保持,同时存储开销与隐私泄露风险均实现了可量化降低。我们在CelebA数据集上开展了性别分类实验,结果表明所提流水线切实实现了理论承诺:图像中的人脸无法辨识(或根据噪声水平而辨识度降低),整体数据存储量大幅减少,而分类准确率未出现本质损失(某些情况下甚至略有提升)。此外,实验还表明该方法能显著增强模型对抗对抗性测试数据的鲁棒性。