In large-scale statistical modeling, reducing data size through subsampling is essential for balancing computational efficiency and statistical accuracy. We propose a new method, Principal Component Analysis guided Quantile Sampling (PCA-QS), which projects data onto principal components and applies quantile-based sampling to retain representative and diverse subsets. Compared with uniform random sampling, leverage score sampling, and coreset methods, PCA-QS consistently achieves lower mean squared error and better preservation of key data characteristics, while also being computationally efficient. This approach is adaptable to a variety of data scenarios and shows strong potential for broad applications in statistical computing.
翻译:在大规模统计建模中,通过子采样缩减数据规模对于平衡计算效率与统计精度至关重要。本文提出一种新方法——主成分分析引导的分位数采样(PCA-QS),该方法将数据投影至主成分上,并应用基于分位数的采样以保留具有代表性且多样化的数据子集。与均匀随机采样、杠杆值采样以及核心集方法相比,PCA-QS在保持计算高效的同时,始终能够实现更低的均方误差,并更好地保留关键数据特征。该方法适用于多种数据场景,在统计计算领域展现出广泛的应用潜力。