In the social sciences, small- to medium-scale datasets are common, and linear regression is canonical. In privacy-aware settings, much work has focused on differentially private (DP) linear regression, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP linear regression methods do not readily support it. Mainstream DP-SDG approaches either are tailored to discrete or discretized data, making them less suitable for analyses involving continuous variables, or rely on deep learning models that require large datasets, limiting their use for the smaller-scale data typical in social science. We propose a method for linear regression with valid inference under Gaussian DP. It includes a bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure such that the corresponding regression on the synthetic data matches our DP linear regression procedure. Our approach is effective in small- to moderate-dimensional settings. Experiments show that our method (1) improves accuracy over existing methods for DP linear regression, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream statistical and machine learning tasks than current DP synthesizers.
翻译:在社会科学中,中小规模数据集很常见,而线性回归是经典方法。在隐私感知场景中,大量研究聚焦于差分隐私(DP)线性回归,但多数侧重于点估计,对不确定性量化关注有限。与此同时,合成数据生成(SDG)在可重复性研究中日益重要,然而现有的DP线性回归方法难以直接支持该任务。主流的DP-SDG方法要么针对离散或离散化数据设计,使其不适用于涉及连续变量的分析,要么依赖需要大规模数据集的深度学习模型,限制了其在社会科学典型的小规模数据中的应用。我们提出一种在高斯DP下实现有效推断的线性回归方法。该方法包含一个带渐近置信区间(CIs)的偏差校正估计器,以及一套通用SDG流程,使得基于合成数据的回归结果与我们的DP线性回归流程相匹配。我们的方法在中小维度场景下表现有效。实验表明,该方法(1)相较现有的DP线性回归方法提升了精度,(2)提供了有效的置信区间,且(3)相较于当前DP合成器,能为下游统计与机器学习任务生成更可靠的合成数据。