With appropriately chosen sampling probabilities, sampling-based random projection can be used to implement large-scale statistical methods, substantially reducing computational cost while maintaining low statistical error. However, computing optimal sampling probabilities is often itself expensive, and in practice one typically resorts to suboptimal schemes. This generally leads to increased time and space costs, as more subsamples are required and the resulting projection matrices become larger, thereby making the inference procedure more computationally demanding. In this paper, we extend the framework of sampling-based random projection and propose a new projection method, \emph{accumulative sub-sampling}. By carefully accumulating multiple such projections, accumulative sub-sampling improves statistical efficiency while controlling the effective matrix size throughout the statistical computation. On the theoretical side, we quantify how the quality of the subsampling scheme affects the error in approximating matrix products and positive semidefinite matrices, and show how the proposed accumulation strategy mitigates this effect. Moreover, we apply our method to statistical models involving intensive matrix operations, such as eigendecomposition in spectral clustering and matrix inversion in kernel ridge regression, and demonstrate that reducing the effective matrix size leads to substantial computational savings. Numerical experiments across a range of problems further show that our approach consistently improves computational efficiency compared to existing random projection baselines under suboptimal sampling schemes.
翻译:通过适当选择采样概率,基于采样的随机投影可用于实现大规模统计方法,在保持较低统计误差的同时显著降低计算成本。然而,计算最优采样概率本身通常代价高昂,实践中通常采用次优方案。这通常会导致时间和空间成本的增加,因为需要更多子样本且生成的投影矩阵变得更大,从而使推断过程计算需求更高。本文扩展了基于采样的随机投影框架,提出了一种新的投影方法——累积子采样。通过精心累积多个此类投影,累积子采样在控制整个统计计算过程中有效矩阵规模的同时,提高了统计效率。在理论方面,我们量化了子采样方案质量如何影响矩阵乘积和半正定矩阵近似的误差,并展示了所提出的累积策略如何缓解这种影响。此外,我们将该方法应用于涉及密集矩阵运算的统计模型,如谱聚类中的特征分解和核岭回归中的矩阵求逆,证明减小有效矩阵规模可带来显著的计算节省。一系列问题的数值实验进一步表明,在次优采样方案下,与现有随机投影基线相比,我们的方法能持续提升计算效率。