Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.

翻译：现代统计分析经常面临大规模数据集。对于这些数据集，传统估计方法难以直接使用，因为实践者通常受限于有限的计算资源。在大多数情况下，他们并不具备强大的计算资源（如Hadoop或Spark）。因此，如何在有限计算资源下实际分析大数据成为一个至关重要的问题。为解决这一问题，本文提出了一种基于子采样与刀切法的新方法。其核心思想是将全样本数据视为总体，然后通过简单随机有放回抽样方法获取多个规模大幅缩减的子样本。值得注意的是，我们不推荐无放回抽样方法，因为这会显著增加硬盘数据处理成本——而数据在内存中处理时则不存在此类成本。由于子样本数据规模较小，它们可以整体读入计算机内存并轻松处理。基于子样本数据集，可为目标参数获得刀切法去偏估计量。所得估计量具有统计一致性，且偏差极小。最后，将不同子样本的刀切法去偏估计量进行平均，得到最终估计量。我们在理论上证明该最终估计量具有一致性和渐近正态性，在非常温和的条件下，其渐近统计效率可达到与全样本估计量相同的水平。所提方法足够简单，可在大多数实际计算机系统上轻松实现，因而应具有极其广泛的应用前景。