We consider the Sparse Principal Component Analysis (SPCA) problem under the well-known spiked covariance model. Recent work has shown that the SPCA problem can be reformulated as a Mixed Integer Program (MIP) and can be solved to global optimality, leading to estimators that are known to enjoy optimal statistical properties. However, prior MIP algorithms for SPCA appear to be limited in terms of scalability to up to a thousand features or so. In this paper, we propose a new estimator for SPCA which can be formulated as a MIP. Different from earlier work, we make use of the underlying spiked covariance model and properties of the multivariate Gaussian distribution to arrive at our estimator. We establish statistical guarantees for our proposed estimator in terms of estimation error and support recovery. We derive guarantees under departures from the spiked covariance model, and for approximate solutions to the optimization problem. We propose a custom algorithm to solve the MIP, which scales better than off-the-shelf solvers, and demonstrate that our approach can be much more computationally attractive compared to earlier exact MIP-based approaches for the SPCA problem. Our numerical experiments on synthetic and real datasets show that our algorithms can address problems with up to 20,000 features in minutes; and generally result in favorable statistical properties compared to existing popular approaches for SPCA.
翻译:我们考虑在经典的尖峰协方差模型下的稀疏主成分分析问题。近期研究表明,该问题可转化为混合整数规划并求得全局最优解,从而获得具有最优统计性质的估计量。然而,现有基于混合整数规划的稀疏主成分分析算法在可扩展性方面存在局限,仅能处理约千维以下特征。本文提出一种可表述为混合整数规划的新型稀疏主成分分析估计量。与先前工作不同,我们利用尖峰协方差模型的内在结构及多元高斯分布性质推导该估计量。我们建立了所提估计量在估计误差与支撑恢复方面的统计保证,并推导了模型偏离尖峰协方差假设及优化问题近似解情形下的理论保障。我们设计了一种专用算法求解该混合整数规划,其可扩展性优于通用求解器,并证明相比现有基于混合整数规划的精确稀疏主成分分析方法,本方法具有显著计算优势。在合成与真实数据集上的数值实验表明,本算法可在数分钟内处理含20000个特征的问题,且相较于主流稀疏主成分分析方法通常展现更优的统计性能。