Multiple heterogeneous data sources are becoming increasingly available for statistical analyses in the era of big data. As an important example in finite-population inference, we develop a unified framework of the test-and-pool approach to general parameter estimation by combining gold-standard probability and non-probability samples. We focus on the case when the study variable is observed in both datasets for estimating the target parameters, and each contains other auxiliary variables. Utilizing the probability design, we conduct a pretest procedure to determine the comparability of the non-probability data with the probability data and decide whether or not to leverage the non-probability data in a pooled analysis. When the probability and non-probability data are comparable, our approach combines both data for efficient estimation. Otherwise, we retain only the probability data for estimation. We also characterize the asymptotic distribution of the proposed test-and-pool estimator under a local alternative and provide a data-adaptive procedure to select the critical tuning parameters that target the smallest mean square error of the test-and-pool estimator. Lastly, to deal with the non-regularity of the test-and-pool estimator, we construct a robust confidence interval that has a good finite-sample coverage property.
翻译:在大数据时代,多种异质性数据源日益成为统计分析的重要基础。针对有限总体推断这一典型应用场景,我们构建了基于检验-合并策略的统一参数估计框架,通过结合黄金标准概率样本与非概率样本实现参数估计。本研究聚焦于研究变量在两个数据集中均有观测(用于目标参数估计)且各自包含其他辅助变量的情形。利用概率抽样设计,我们提出预检验程序用以判定非概率数据与概率数据的可比性,进而决定是否在合并分析中利用非概率数据。当两类数据可比时,该框架融合两组数据实现高效估计;反之则仅保留概率数据进行估计。我们进一步刻画了局部备择假设下所提检验-合并估计量的渐近分布,并设计出以最小化检验-合并估计量均方误差为目标的自适应数据驱动临界参数选择方法。最后,针对检验-合并估计量的非正则性,我们构建了具有良好有限样本覆盖性质的稳健置信区间。