There has been increasing demand for establishing privacy-preserving methodologies for modern statistics and machine learning. Differential privacy, a mathematical notion from computer science, is a rising tool offering robust privacy guarantees. Recent work focuses primarily on developing differentially private versions of individual statistical and machine learning tasks, with nontrivial upstream pre-processing typically not incorporated. An important example is when record linkage is done prior to downstream modeling. Record linkage refers to the statistical task of linking two or more data sets of the same group of entities without a unique identifier. This probabilistic procedure brings additional uncertainty to the subsequent task. In this paper, we present two differentially private algorithms for linear regression with linked data. In particular, we propose a noisy gradient method and a sufficient statistics perturbation approach for the estimation of regression coefficients. We investigate the privacy-accuracy tradeoff by providing finite-sample error bounds for the estimators, which allows us to understand the relative contributions of linkage error, estimation error, and the cost of privacy. The variances of the estimators are also discussed. We demonstrate the performance of the proposed algorithms through simulations and an application to synthetic data.
翻译:随着现代统计学和机器学习对隐私保护方法论的需求日益增长,差分隐私作为一种源自计算机科学的数学工具,正逐渐成为提供强健隐私保障的有效手段。已有研究主要聚焦于为单项统计与机器学习任务开发差分隐私版本,但通常未考虑上游非平凡的数据预处理步骤。一个典型场景是在下游建模前需进行记录链接。记录链接是指在没有唯一标识符的情况下,将属于同一实体集合的两个或多个数据集进行统计关联的任务,该概率性过程会给后续任务带来额外的不确定性。本文提出了两种面向链接数据的差分隐私线性回归算法:其一为噪声梯度方法,其二为充分统计量扰动方法,两者均用于回归系数的估计。通过推导估计量的有限样本误差界,我们探究了隐私-精度权衡关系,从而可解析链接误差、估计误差与隐私成本之间的相对贡献。本文还讨论了估计量的方差特性,并通过模拟实验和合成数据应用验证了所提算法的性能。