When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.
翻译:为解决保密数据公开使用文件的生成问题,统计机构可采用完全合成数据方法。本文提出一种适用于复杂抽样设计调查数据的全合成数据生成方法,遵循Rubin(1993)提出的总体策略框架。具体而言,我们通过加权有限总体贝叶斯自助法生成伪总体以纳入调查权重,从伪总体中抽取简单随机样本,基于这些样本估计合成模型,并将模型模拟生成的数据作为公开使用文件发布。为便于方差估计,我们采用多重插补框架并设计两种数据生成策略:第一种策略从每个简单随机样本生成多个数据集;第二种策略从每个简单随机样本仅生成单个合成数据集。针对两种场景分别给出多重插补组合规则,并通过模拟研究验证组合规则的重复抽样性质,同时与基于伪似然方法的合成数据生成进行对比。我们将所提方法应用于美国社区调查数据子集。