The two-phase sampling design is a cost-effective sampling strategy that has been widely used in public health research. The conventional approach in this design is to create subsample specific weights that adjust for probability of selection and response in the second phase. However, these weights can be highly variable which in turn results in unstable weighted analyses. Alternatively, we can use the rich data collected in the first phase of the study to improve the survey inference of the second phase sample. In this paper, we use a Bayesian tree-based multiple imputation (MI) approach for estimating population means using a two-phase survey design. We demonstrate how to incorporate complex survey design features, such as strata, clusters, and weights, into the imputation procedure. We use a simulation study to evaluate the performance of the tree-based MI approach in comparison to the alternative weighted analyses using the subsample weights. We find the tree-based MI method outperforms weighting methods with smaller bias, reduced root mean squared error, and narrower 95\% confidence intervals that have closer to the nominal level coverage rate. We illustrate the application of the proposed method by estimating the prevalence of diabetes among the United States non-institutionalized adult population using the fasting blood glucose data collected only on a subsample of participants in the 2017-2018 National Health and Nutrition Examination Survey.
翻译:两阶段抽样设计是一种经济高效的抽样策略,已广泛应用于公共卫生研究。该设计的传统方法是创建子样本特定权重,以调整第二阶段的入选概率和应答偏差。然而,这些权重可能具有高度变异性,从而导致加权分析结果不稳定。另一种方法是利用研究第一阶段收集的丰富数据来改进第二阶段样本的调查推断。本文采用基于贝叶斯树的多重插补方法,通过两阶段调查设计估计总体均值。我们展示了如何将复杂调查设计特征(如分层、整群和权重)纳入插补过程。通过模拟研究,我们评估了基于树的多重插补方法与使用子样本权重的替代加权分析方法的性能。研究发现,基于树的多重插补方法在偏差更小、均方根误差降低以及95%置信区间更窄且覆盖率更接近名义水平方面优于加权方法。我们利用2017-2018年国家健康与营养调查中仅对部分参与者子样本采集的空腹血糖数据,通过估计美国非机构化成年人群中的糖尿病患病率,展示了所提出方法的应用。