We propose a novel CDF estimator that integrates data from probability samples with data from, potentially big, nonprobability samples. Assuming that a set of shared covariates are observed in both, while the response variable is observed only in the latter, the proposed estimator uses a survey-weighted empirical CDF of regression residuals trained on the convenience sample to estimate the CDF of the response variable. Under some assumptions, we derive the asymptotic bias and variance of our CDF estimator and show that it is asymptotically unbiased for the finite population CDF if ignorability holds. Empirical results demonstrate that the estimator performs well under model misspecification when ignorability holds, and under nonignorable sampling when the outcome model is correctly specified. Even when both assumptions fail, the residual-based estimator continues to outperform its plug-in and na\"ive counterparts, albeit with noted decreases in efficiency.
翻译:我们提出了一种新颖的累积分布函数估计器,该估计器将来自概率样本的数据与(可能规模较大的)非概率样本数据进行整合。假设在两类样本中均可观测到一组共同的协变量,而响应变量仅在后一类样本中被观测到,所提出的估计器利用基于便利样本训练的回归残差的调查加权经验累积分布函数,来估计响应变量的累积分布函数。在某些假设条件下,我们推导了该累积分布函数估计器的渐近偏差与方差,并证明若可忽略性条件成立,则该估计器对有限总体累积分布函数是渐近无偏的。实证结果表明,当可忽略性条件成立但模型设定存在错误时,或当结果模型设定正确但抽样过程不可忽略时,该估计器均表现良好。即使上述两个假设均不成立,基于残差的估计器在效率有所下降的情况下,其表现仍优于其对应的插件估计器及朴素估计器。