The aim of survey statistics is to produce estimates with a minimal bias and a corresponding acceptable variance given a specific budget, preferable with a minor response burden for the participants. In recent years, considerable efforts have been made to achieve this through the extended use of found or non-probability data. However, to be able to safely utilize such data, rigorous theoretical foundations is needed, where one main concern is the of lack control due to not having access to the selection mechanism for the data. Several methods have been proposed in the literature to deal with this, though often relying on assumptions that may be difficult or impossible to verify in practice. Extending on the Data Integrated (DI) estimator introduced by Kim and Tam (2021), this paper introduce the Model Assisted Data Integration (MADI) sampling strategy. The proposed sampling strategy includes an estimator that has the desired properties: it is design-unbiased, has a design-unbiased variance estimator and is suitable for the intense production cycle of the statistical agency. The estimator uses nonprobability data combined with a probability sample that has a sampling design which aims to include individuals not captured by the nonprobability data. The estimator can use arbitrary machine learning models to produce unbiased estimates. A main conclusion of the paper is that the proposed sampling strategy can produce estimates with much lower variances compared to traditional survey estimators, and we use real empirical data to illustrate this point.
翻译:调查统计的目标是在特定预算下,以最小的参与者回答负担,产生偏差最小且方差可接受的估计值。近年来,通过广泛使用发现数据或非概率数据,为实现这一目标做出了大量努力。然而,要安全地利用此类数据,需要严格的理论基础,其中主要问题是由于无法获取数据的选择机制而缺乏控制。文献中提出了几种处理这一问题的方法,但往往依赖于在实践中可能难以或无法验证的假设。本文在Kim和Tam(2021)提出的数据整合(DI)估计量的基础上,引入了模型辅助数据整合(MADI)抽样策略。所提出的抽样策略包含一个具有理想属性的估计量:它是设计无偏的,具有设计无偏的方差估计量,并且适用于统计机构高强度的生产周期。该估计量使用非概率数据与概率样本相结合,该概率样本的抽样设计旨在包含未被非概率数据捕获的个体。该估计量可以利用任意机器学习模型来产生无偏估计。本文的主要结论是,与传统的调查估计量相比,所提出的抽样策略能够产生方差大幅降低的估计值,我们使用真实经验数据来阐明这一点。