The use of big data in official statistics and the applied sciences is accelerating, but statistics computed using only big data often suffer from substantial selection bias. This leads to inaccurate estimation and invalid statistical inference. We rectify the issue for a broad class of linear and nonlinear statistics by producing estimating equations that combine big data with a probability sample. Under weak assumptions about an unknown superpopulation, we show that our integrated estimator is consistent and asymptotically unbiased with an asymptotic normal distribution. Variance estimators with respect to both the sampling design alone and jointly with the superpopulation are obtained at once using a single, unified theoretical approach. A surprising corollary is that strategies minimising the design variance almost minimise the joint variance when the population and sample sizes are large. The integrated estimator is shown to be more efficient than its survey-only counterpart if dependence between sample membership indicators is small and the finite population is large. We illustrate our method for quantiles, the Gini index, linear regression coefficients and maximum likelihood estimators where the sampling design is stratified simple random sampling without replacement. Our results are illustrated in a simulation of individual Australian incomes.
翻译:在官方统计和应用科学中,大数据的应用正日益加速,但仅基于大数据计算的统计量往往存在显著的选择偏差,导致估计不准确和统计推断失效。我们通过构建将大数据与概率样本相结合的估计方程,解决了一类广泛的线性和非线性统计量问题。在对未知超总体做出弱假设的条件下,我们证明了所提出的综合估计量具有一致性、渐近无偏性,并服从渐近正态分布。通过单一统一的理论方法,我们同时获得了仅基于抽样设计以及联合超总体下的方差估计量。一个令人惊讶的推论是:当总体和样本量较大时,最小化设计方差的策略几乎等价于最小化联合方差。研究表明,当样本成员指示变量间的依赖性较小且有限总体规模较大时,所提出的综合估计量比仅基于调查的估计量更有效。我们通过分位数、基尼系数、线性回归系数及最大似然估计量等案例验证该方法,其中抽样设计采用分层简单随机无放回抽样。基于澳大利亚个体收入数据的模拟结果进一步验证了本方法的有效性。