Data stewards and analysts can promote transparent and trustworthy science and policy-making by facilitating assessments of the sensitivity of published results to alternate analysis choices. For example, researchers may want to assess whether the results change substantially when different subsets of data points (e.g., sets formed by demographic characteristics) are used in the analysis, or when different models (e.g., with or without log transformations) are estimated on the data. Releasing the results of such stability analyses leaks information about the data subjects. When the underlying data are confidential, the data stewards and analysts may seek to bound this information leakage. We present methods for stability analyses that can satisfy differential privacy, a definition of data confidentiality providing such bounds. We use regression modeling as the motivating example. The basic idea is to split the data into disjoint subsets, compute a measure summarizing the difference between the published and alternative analysis on each subset, aggregate these subset estimates, and add noise to the aggregated value to satisfy differential privacy. We illustrate the methods using regressions in which an analyst compares coefficient estimates for different groups in the data, and in which analysts fit two different models on the data.
翻译:数据管理员和分析人员可通过促进对已发布结果在不同分析选择下敏感性的评估,推动科学和政策制定的透明化与可信度。例如,研究人员可能希望评估:当分析采用不同数据点子集(如按人口学特征划分的集合)时,或对数据估计不同模型(如是否进行对数变换)时,结果是否发生显著变化。发布此类稳定性分析的结果会泄露数据主体的信息。当基础数据具有保密性时,数据管理员和分析人员可能希望限制这种信息泄露。我们提出了一种满足差分隐私的稳定性分析方法——差分隐私是一种为数据保密性提供此类限制的定义。本文以回归建模为动机性示例。基本思想是:将数据划分为互斥子集,计算每个子集上发布结果与替代分析结果差异的汇总度量,聚合这些子集估计值,并对聚合值添加噪声以满足差分隐私。我们通过两种回归案例说明该方法:一是分析人员比较数据中不同群体的系数估计值,二是分析人员在数据上拟合两种不同模型。