In social sciences, small- to medium-scale datasets are common and linear regression (LR) is canonical. In privacy-aware settings, much work has focused on differentially private (DP) LR, but mostly on point estimation with limited attention to uncertainty quantification. Meanwhile, synthetic data generation (SDG) is increasingly important for reproducibility studies, yet current DP LR methods do not readily support it. Mainstream SDG approaches are either tailored to discretized data, making them less suitable for continuous regression, or rely on deep models that require large datasets, limiting their use for the smaller, continuous data typical in social science. We propose a method for LR with valid inference under Gaussian DP: a DP bias-corrected estimator with asymptotic confidence intervals (CIs) and a general SDG procedure in which regression on the synthetic data matches our DP regression. Our binning-aggregation strategy is effective in small- to moderate-dimensional settings. Experiments show our method (1) improves accuracy over existing methods, (2) provides valid CIs, and (3) produces more reliable synthetic data for downstream ML tasks than current DP SDGs.
翻译:在社会科学中,中小规模数据集十分常见,而线性回归(LR)是一种经典方法。在注重隐私保护的场景下,已有大量工作聚焦于差分隐私(DP)线性回归,但主要集中于点估计,对不确定性量化关注有限。与此同时,合成数据生成(SDG)对于可重复性研究日益重要,然而当前的差分隐私线性回归方法并未直接支持它。主流的合成数据生成方法要么专为离散化数据设计,使其不太适用于连续型回归数据;要么依赖于需要大规模数据的深度模型,这限制了其在社会科学中典型的小规模连续数据上的应用。我们提出了一种在高斯差分隐私下具有有效推断能力的线性回归方法:一个带有渐近置信区间(CI)的差分隐私偏差校正估计器,以及一个通用的合成数据生成流程,在该流程中,对合成数据进行的回归与我们提出的差分隐私回归结果一致。我们的分箱-聚合策略在中小维度场景下效果显著。实验表明,我们的方法(1)相比现有方法提高了准确性,(2)提供了有效的置信区间,并且(3)与当前的差分隐私合成数据生成方法相比,能为下游机器学习任务生成更可靠的合成数据。