In social sciences, small- to medium-scale datasets are common and linear regression (LR) is canonical. In privacy-aware settings, much work has focused on differentially private (DP) LR, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP LR methods do not readily support it. Mainstream SDG approaches are either tailored to discretized data, making them less suitable for continuous regression, or rely on deep models that require large datasets, limiting their use for the smaller, continuous data typical in social science. We propose a method for LR with valid inference under Gaussian DP: a DP bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure in which regression on the synthetic data matches our DP regression. Our binning-aggregation strategy is effective in small- to moderate-dimensional settings. Experiments show our method (1) improves accuracy over existing methods, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream ML tasks than current DP SDGs.
翻译:在社会科学中,中小规模数据集很常见,线性回归(LR)是经典方法。在注重隐私的环境中,许多工作聚焦于差分隐私(DP)线性回归,但主要集中在点估计上,对不确定性量化的关注有限。与此同时,合成数据生成(SDG)对于可重复性研究日益重要,然而当前的DP LR方法并不直接支持它。主流的SDG方法要么是针对离散化数据设计的,使其不太适用于连续回归;要么依赖于需要大型数据集的深度模型,限制了它们在社会科学中典型的小规模连续数据上的应用。我们提出了一种在高斯DP下具有有效推断的LR方法:一种具有渐近置信区间(CIs)的DP偏差校正估计器,以及一个通用的SDG流程,在该流程中,对合成数据的回归与我们的DP回归结果一致。我们的分箱-聚合策略在中小维度设置下是有效的。实验表明,我们的方法(1)相比现有方法提高了准确性,(2)提供了有效的置信区间,并且(3)相比当前的DP SDG方法,能为下游机器学习任务生成更可靠的合成数据。