In the age of big data, nonprobability surveys are becoming increasingly abundant. Data integration techniques involving both probability and nonprobability surveys are being extensively used for providing improved estimates for finite population estimation. While much of the existing research has focused on mitigating selection bias in nonprobability surveys, the issue of measurement error within these surveys remains relatively unexplored. Statistical methods devised with the purpose of reducing selection bias are appropriate for reliable estimation, only under the assumption of accuracy of survey responses. Motivated by a recent case study of Kennedy, Mercer, and Lau (2024), our research addresses bias from both measurement and sampling errors in nonprobability surveys. In this article, we propose a new data integration method that uses multiple probability and nonprobability surveys and leverages machine learning models to construct a composite estimator. The proposed composite estimator integrates probability and nonprobability surveys, when both contain response variables of interest. We analyze the performance of this estimator in comparison to an existing composite estimator in literature, analytically as well as empirically, using multiple survey data from Kennedy et al. (2024). Finally, we identify conditions under which the proposed estimator outperforms estimators based solely on probability surveys.
翻译:在大数据时代,非概率调查正变得日益普遍。涉及概率与非概率调查的数据整合技术正被广泛用于为有限总体估计提供更优的估计值。尽管现有研究大多聚焦于减轻非概率调查中的选择偏差,但这些调查中的测量误差问题仍相对未被充分探讨。旨在减少选择偏差的统计方法,仅在假设调查回答准确的前提下,才适用于可靠估计。受Kennedy、Mercer和Lau(2024)近期一项案例研究的启发,我们的研究同时处理了非概率调查中由测量误差与抽样误差引起的偏差。本文提出了一种新的数据整合方法,该方法利用多个概率与非概率调查,并借助机器学习模型构建一个复合估计量。所提出的复合估计量在概率与非概率调查均包含感兴趣的响应变量时,对两者进行整合。我们通过解析分析与实证研究,使用Kennedy等人(2024)的多项调查数据,将该估计量的性能与文献中已有的一个复合估计量进行比较分析。最后,我们明确了所提估计量优于仅基于概率调查的估计量的条件。