Subsampling is a widely used and effective approach for addressing the computational challenges posed by massive datasets. Substantial progress has been made in developing non-uniform, probability-based subsampling schemes that prioritize more informative observations. We propose a novel stratification mechanism that can be combined with existing subsampling designs to further improve estimation efficiency. We establish the estimator's asymptotic normality and quantify the resulting efficiency gains, which enables a principled procedure for selecting stratification variables and interval boundaries that target reductions in asymptotic variance. The resulting algorithm, Maximum-Variance-Reduction Stratification (MVRS), achieves significant improvements in estimation efficiency while incurring only linear additional computational cost. MVRS is applicable to both non-uniform and uniform subsampling methods. Experiments on simulated and real datasets confirm that MVRS markedly reduces estimator variance and improves accuracy compared with existing subsampling methods.
翻译:子采样是应对海量数据集计算挑战的广泛使用且有效的方法。在开发非均匀、基于概率的优先选择信息量更大观测值的子采样方案方面已取得实质性进展。我们提出一种新颖的分层机制,可与现有子采样设计结合以进一步提升估计效率。我们建立了估计量的渐近正态性并量化由此获得的效率增益,这为选择以降低渐近方差为目标的分层变量与区间边界提供了理论依据。所提出的算法——最大方差缩减分层(MVRS),在仅产生线性额外计算成本的同时实现了估计效率的显著提升。MVRS同时适用于非均匀与均匀子采样方法。在模拟和真实数据集上的实验证实,与现有子采样方法相比,MVRS显著降低了估计量方差并提升了估计精度。