The bootstrap is a widely used procedure for statistical inference because of its simplicity and attractive statistical properties. However, the vanilla version of bootstrap is no longer feasible computationally for many modern massive datasets due to the need to repeatedly resample the entire data. Therefore, several improvements to the bootstrap method have been made in recent years, which assess the quality of estimators by subsampling the full dataset before resampling the subsamples. Naturally, the performance of these modern subsampling methods is influenced by tuning parameters such as the size of subsamples, the number of subsamples, and the number of resamples per subsample. In this paper, we develop a novel hyperparameter selection methodology for selecting these tuning parameters. Formulated as an optimization problem to find the optimal value of some measure of accuracy of an estimator subject to computational cost, our framework provides closed-form solutions for the optimal hyperparameter values for subsampled bootstrap, subsampled double bootstrap and bag of little bootstraps, at no or little extra time cost. Using the mean square errors as a proxy of the accuracy measure, we apply our methodology to study, compare and improve the performance of these modern versions of bootstrap developed for massive data through simulation study. The results are promising.
翻译:自助法因其简单性和良好的统计性质而被广泛用于统计推断。然而,对于许多现代海量数据集而言,标准版自助法需反复重抽样整个数据集,其在计算上已不再可行。因此,近年来对自助法进行了若干改进,这些方法通过先对完整数据集进行子抽样,再对子样本进行重抽样来评估估计量的质量。自然,这些现代子抽样方法的性能受到调整参数(如子样本大小、子样本数量以及每个子样本的重抽样次数)的影响。本文提出了一种新颖的超参数选择方法,用于选取这些调整参数。该方法将问题构建为在计算成本约束下优化估计量某种精度测度的优化问题,针对子抽样自助法、子抽样双自助法及小自助包,以零或极小的额外时间成本,给出了最优超参数值的闭式解。以均方误差作为精度测度的代理,我们通过模拟研究,应用所提方法对针对海量数据开发的这些现代自助法变体进行性能研究、比较与改进。结果令人鼓舞。