There is a growing trend among statistical agencies to explore non-probability data sources for producing more timely and detailed statistics, while reducing costs and respondent burden. Coverage and measurement error are two issues that may be present in such data. The imperfections may be corrected using available information relating to the population of interest, such as a census or a reference probability sample. In this paper, we compare a wide range of existing methods for producing population estimates using a non-probability dataset through a simulation study based on a realistic business population. The study was conducted to examine the performance of the methods under different missingness and data quality assumptions. The results confirm the ability of the methods examined to address selection bias. When no measurement error is present in the non-probability dataset, a screening dual-frame approach for the probability sample tends to yield lower sample size and mean squared error results. The presence of measurement error and/or nonignorable missingness increases mean squared errors for estimators that depend heavily on the non-probability data. In this case, the best approach tends to be to fall back to a model-assisted estimator based on the probability sample.
翻译:统计机构日益倾向于探索非概率数据源,以期在降低成本和受访者负担的同时,生成更及时、更详细的统计量。此类数据可能存在覆盖误差和测量误差两类问题。这些缺陷可利用目标总体的可用信息(如普查数据或参考概率样本)进行修正。本文通过基于现实商业总体的模拟研究,系统比较了利用非概率数据集生成总体估计值的多种现有方法。研究旨在检验不同缺失机制与数据质量假设下各方法的性能表现。结果证实了所考察方法处理选择偏误的能力。当非概率数据集中不存在测量误差时,对概率样本采用筛选双框法往往能获得更小的样本量与均方误差。若存在测量误差和/或不可忽略的缺失,则严重依赖非概率数据的估计量将产生更高的均方误差。此时,最佳策略往往是退而采用基于概率样本的模型辅助估计量。