Doubly robust integration of nonprobability and probability survey data

from arxiv, 66 pages, 31 figures. The preprint v2 extends the paper with: domain estimation; a new Hajek-style version of the Kim--Haziza doubly robust estimator; and, theory on the asymptotic relative efficiency of the combined estimators and a simulation study to assess the relative efficiency

Doubly robust estimators for estimating the population mean (or prevalence) of an outcome have been proposed for integrating outcome and covariate data from a nonprobability sample with covariate data from a probability survey. These estimators combine inverse probability weighting estimation with mass imputation. However, the question of how to combine these doubly robust estimators with a Horvitz-Thompson or Hajek estimator that uses only outcome data from the probability survey has received only limited attention. In this paper, we first review previously proposed doubly robust estimators that use outcome data from only the nonprobability sample. We extend these estimators to enable estimation of domain (subpopulation) means (or prevalences), possibly using data from individuals outside the domain to improve estimation when the domain is small. We then consider how to combine this doubly robust estimator with a Horvitz-Thompson or Hajek estimator that uses only the probability survey data. We describe efficient combined estimators, and provide formulae for their repeated-sampling variances and for estimators of these variances. We also investigate the asymptotic relative efficiencies of the combined estimators compared to their two component estimators, and carry out a simulation study to assess their relative efficiencies in finite samples. These relative efficiencies depend on the ratio of the variances of the two component estimators and on how predictive the covariates are of the outcome.

翻译：本文针对非概率样本中的结果变量与协变量数据，结合概率调查中的协变量数据，提出了估计总体均值（或患病率）的双重稳健估计量。这类估计量将逆概率加权估计与多重插补法相结合。然而，关于如何将这些双重稳健估计量与仅使用概率调查结果数据的霍维茨-汤普森估计量或哈耶克估计量进行整合的问题，目前关注不足。本文首先回顾了先前提出的仅使用非概率样本结果数据的双重稳健估计量，并将这些估计量扩展至域（子总体）均值（或患病率）的估计，在域样本量较小时可借助域外个体数据提升估计精度。随后，本文探讨如何将此类双重稳健估计量与仅使用概率调查数据的霍维茨-汤普森估计量或哈耶克估计量进行整合。我们提出了高效组合估计量，给出了其重复抽样方差及方差估计量的计算公式，并研究了组合估计量相对于其两个分量估计量的渐近相对效率。通过模拟研究评估其在有限样本下的相对效率，结果表明该效率取决于两个分量估计量的方差比以及协变量对结果变量的预测能力。