When seeking to release public use files for confidential data, statistical agencies can generate fully synthetic data. We propose an approach for making fully synthetic data from surveys collected with complex sampling designs. Our approach adheres to the general strategy proposed by Rubin (1993). Specifically, we generate pseudo-populations by applying the weighted finite population Bayesian bootstrap to account for survey weights, take simple random samples from those pseudo-populations, estimate synthesis models using these simple random samples, and release simulated data drawn from the models as public use files. To facilitate variance estimation, we use the framework of multiple imputation with two data generation strategies. In the first, we generate multiple data sets from each simple random sample. In the second, we generate a single synthetic data set from each simple random sample. We present multiple imputation combining rules for each setting. We illustrate the repeated sampling properties of the combining rules via simulation studies, including comparisons with synthetic data generation based on pseudo-likelihood methods. We apply the proposed methods to a subset of data from the American Community Survey.
翻译:为发布机密数据的公开使用文件,统计机构可生成完全合成数据。我们提出一种针对复杂抽样设计调查数据的完全合成数据生成方法。该方法遵循Rubin(1993)提出的总体策略,具体而言:通过加权有限总体贝叶斯自助法生成伪总体以处理调查权重,从伪总体中抽取简单随机样本,基于这些简单随机样本估计合成模型,并将模型模拟数据作为公开使用文件发布。为便于方差估计,采用多重插补框架并设计两种数据生成策略:第一种策略对每个简单随机样本生成多个数据集,第二种策略对每个简单随机样本生成单一合成数据集。针对两种场景提出了多重插补合并规则。通过模拟研究验证合并规则的重复抽样特性,并与基于伪似然方法的合成数据生成进行对比。将所提方法应用于美国社区调查的部分数据。