Subsampling is one of the popular methods to balance statistical efficiency and computational efficiency in the big data era. Most approaches aim at selecting informative or representative sample points to achieve good overall information of the full data. The present work takes the view that sampling techniques are recommended for the region we focus on and summary measures are enough to collect the information for the rest according to a well-designed data partitioning. We propose a multi-resolution subsampling strategy that combines global information described by summary measures and local information obtained from selected subsample points. We show that the proposed method will lead to a more efficient subsample-based estimator for general large-scale classification problems. Some asymptotic properties of the proposed method are established and connections to existing subsampling procedures are explored. Finally, we illustrate the proposed subsampling strategy via simulated and real-world examples.
翻译:在大数据时代,子采样是平衡统计效率与计算效率的常用方法之一。多数方法旨在选取信息量丰富或具有代表性的样本点,以获取完整数据的整体信息。本研究认为,采样技术应聚焦于我们关注的数据区域,而通过精心设计的数据划分,其余部分仅需汇总度量即可充分收集信息。我们提出一种多分辨率子采样策略,该策略结合了由汇总度量描述的全局信息与从选定子样本点获取的局部信息。我们证明,对于一般性大规模分类问题,所提方法将产生更高效的基于子采样的估计量。本文建立了该方法的若干渐近性质,并探讨了其与现有子采样方法的关联。最后,我们通过模拟实验和实际案例对所提出的子采样策略进行了验证。