Vintage factor analysis is one important type of factor analysis that aims to first find a low-dimensional representation of the original data, and then to seek a rotation such that the rotated low-dimensional representation is scientifically meaningful. The most widely used vintage factor analysis is the Principal Component Analysis (PCA) followed by the varimax rotation. Despite its popularity, little theoretical guarantee can be provided to date mainly because varimax rotation requires to solve a non-convex optimization over the set of orthogonal matrices. In this paper, we propose a deflation varimax procedure that solves each row of an orthogonal matrix sequentially. In addition to its net computational gain and flexibility, we are able to fully establish theoretical guarantees for the proposed procedure in a broader context. Adopting this new deflation varimax as the second step after PCA, we further analyze this two step procedure under a general class of factor models. Our results show that it estimates the factor loading matrix in the minimax optimal rate when the signal-to-noise-ratio (SNR) is moderate or large. In the low SNR regime, we offer possible improvement over using PCA and the deflation varimax when the additive noise under the factor model is structured. The modified procedure is shown to be minimax optimal in all SNR regimes. Our theory is valid for finite sample and allows the number of the latent factors to grow with the sample size as well as the ambient dimension to grow with, or even exceed, the sample size. Extensive simulation and real data analysis further corroborate our theoretical findings.
翻译:因子旋转分析是因子分析的重要类型,其目标首先寻找原始数据的低维表示,进而通过旋转使得旋转后的低维表示具有科学意义。最广泛使用的因子旋转分析是主成分分析(PCA)后接方差最大化旋转。尽管该方法广受欢迎,但由于方差最大化旋转需要在正交矩阵集上求解非凸优化问题,至今仍缺乏理论保证。本文提出一种逐次正交化的方差最大化方法,通过顺序求解正交矩阵的每一行向量。除了计算效率与灵活性的显著优势外,我们能够在更广泛的背景下为该方法的理论性质提供完整证明。将这种新型逐次正交化方差最大化作为PCA后的第二步,我们在一般因子模型框架下进一步分析该两步法的理论性质。研究结果表明:当信噪比(SNR)处于中等或较高水平时,该方法能以极小极大最优速率估计因子载荷矩阵;在低信噪比区域,当因子模型中的加性噪声具有特定结构时,我们提出了对PCA结合逐次正交化方差最大化方法的改进方案。理论证明改进后的方法在所有信噪比区域均达到极小极大最优性。我们的理论适用于有限样本情形,允许潜在因子数量随样本量增长,且允许数据维度与样本量同步增长甚至超过样本量。大量的模拟实验与真实数据分析进一步验证了理论结论。