Arbitrary Decisions are a Hidden Cost of Differentially-Private Training

Mechanisms used in privacy-preserving machine learning often aim to guarantee differential privacy (DP) during model training. Practical DP-ensuring training methods use randomization when fitting model parameters to privacy-sensitive data (e.g., adding Gaussian noise to clipped gradients). We demonstrate that such randomization incurs predictive multiplicity: for a given input example, the output predicted by equally-private models depends on the randomness used in training. Thus, for a given input, the predicted output can vary drastically if a model is re-trained, even if the same training dataset is used. The predictive-multiplicity cost of DP training has not been studied, and is currently neither audited for nor communicated to model designers and stakeholders. We derive a bound on the number of re-trainings required to estimate predictive multiplicity reliably. We analyze -- both theoretically and through extensive experiments -- the predictive-multiplicity cost of three DP-ensuring algorithms: output perturbation, objective perturbation, and DP-SGD. We demonstrate that the degree of predictive multiplicity rises as the level of privacy increases, and is unevenly distributed across individuals and demographic groups in the data. Because randomness used to ensure DP during training explains predictions for some examples, our results highlight a fundamental challenge to the justifiability of decisions supported by differentially-private models in high-stakes settings. We conclude that practitioners should audit the predictive multiplicity of their DP-ensuring algorithms before deploying them in applications of individual-level consequence.

翻译：隐私保护机器学习中使用的机制通常旨在保证模型训练期间的差分隐私。实用的差分隐私训练方法在拟合模型参数时使用随机化技术处理隐私敏感数据（例如向裁剪后的梯度添加高斯噪声）。我们证明此类随机化会导致预测多重性：对于给定输入样本，具有同等隐私保护水平的模型预测输出取决于训练中使用的随机性。因此，即使使用相同训练数据集重新训练模型，同一输入的预测结果也可能发生剧烈变化。差分隐私训练的预测多重性代价尚未得到研究，目前既未被审计也未被传达给模型设计者和利益相关者。我们推导出可靠估计预测多重性所需重训练次数的界限。通过理论分析与大量实验，我们研究了三种差分隐私保证算法（输出扰动、目标扰动和DP-SGD）的预测多重性代价。研究表明预测多重性程度随隐私保护水平提升而增加，且在数据中的个体和人口群体间分布不均。由于训练中用于保证差分隐私的随机性解释了部分样本的预测结果，我们的研究结果凸显了高风险场景下差分隐私模型决策可正当性的根本挑战。结论指出，实践者在将差分隐私算法部署到影响个体利益的应用程序前，应主动审计其预测多重性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

不可错过！《机器学习100讲》课程，UBC Mark Schmidt讲授

专知会员服务

76+阅读 · 2022年6月28日

剑桥大学《数据科学: 原理与实践》课程，附PPT下载

专知会员服务

54+阅读 · 2021年1月20日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日

【深度学习表格检测、信息提取和结构化】《Table Detection, Information Extraction and Structuring using Deep Learning》by Vihar Kurama

专知会员服务

38+阅读 · 2020年1月23日