A common approach to synthetic data is to sample from a fitted model. We show that under general assumptions, this approach results in a sample with inefficient estimators and whose joint distribution is inconsistent with the true distribution. Motivated by this, we propose a general method of producing synthetic data, which is widely applicable for parametric models, has asymptotically efficient summary statistics, and is both easily implemented and highly computationally efficient. Our approach allows for the construction of both partially synthetic datasets, which preserve certain summary statistics, as well as fully synthetic data which satisfy the strong guarantee of differential privacy (DP), both with the same asymptotic guarantees. We also provide theoretical and empirical evidence that the distribution from our procedure converges to the true distribution. Besides our focus on synthetic data, our procedure can also be used to perform approximate hypothesis tests in the presence of intractable likelihood functions.
翻译:合成数据的常见方法是从拟合模型中抽样。我们证明,在一般假设下,该方法会导致样本估计量效率低下,且其联合分布与真实分布不一致。基于此,我们提出一种通用的合成数据生成方法,该方法广泛适用于参数模型,具有渐近有效的汇总统计量,且易于实现并具有极高的计算效率。我们的方法可构建部分合成数据集(保留特定汇总统计量)以及完全合成数据集(满足差分隐私的强保证,DP),两者均具有相同的渐近保证。我们还从理论和实证两方面证明,我们方法生成的分布将收敛到真实分布。除聚焦合成数据外,我们的方法还可用于在似然函数难以处理的情况下执行近似假设检验。