In this paper we study predictive mean matching mass imputation estimators to integrate data from probability and non-probability samples. We consider two approaches: matching predicted to predicted ($\hat{y}-\hat{y}$~matching; PMM A) and predicted to observed ($\hat{y}-y$~matching; PMM B) values. We prove the consistency of two semi-parametric mass imputation estimators based on these approaches and derive their variance and estimators of variance. We underline the differences of our approach with the nearest neighbour approach proposed by Yang et al. (2021) and prove consistency of the PMM A estimator under model mis-specification. Our approach can be employed with non-parametric regression techniques, such as kernel regression, and the analytical expression for variance can also be applied in nearest neighbour matching for non-probability samples. We conduct extensive simulation studies in order to compare the properties of this estimator with existing approaches, discuss the selection of $k$-nearest neighbours, and study the effects of model mis-specification. The paper finishes with empirical study in integration of job vacancy survey and vacancies submitted to public employment offices (admin and online data). Open source software is available for the proposed approaches.
翻译:本文研究了利用预测均值匹配进行大规模插补的估计方法,以整合概率样本与非概率样本数据。我们考虑了两种匹配方式:预测值对预测值匹配($\hat{y}-\hat{y}$匹配;PMM A)以及预测值对观测值匹配($\hat{y}-y$匹配;PMM B)。我们证明了基于这两种方法的半参数大规模插补估计量的一致性,并推导了其方差及方差估计量。我们着重阐述了本方法与Yang等人(2021)提出的最近邻方法之间的差异,并证明了在模型设定错误情况下PMM A估计量仍保持一致性。本方法可与核回归等非参数回归技术结合使用,其方差解析表达式同样适用于非概率样本的最近邻匹配。我们通过大量模拟研究比较了该估计量与现有方法的性质,探讨了$k$近邻数的选择问题,并分析了模型设定错误的影响。最后通过整合职位空缺调查数据与公共就业办公室空缺申报数据(行政数据与在线数据)的实证研究进行验证。相关开源软件已为所提方法提供支持。