Optimal vintage factor analysis with deflation varimax

Vintage factor analysis is one important type of factor analysis that aims to first find a low-dimensional representation of the original data, and then to seek a rotation such that the rotated low-dimensional representation is scientifically meaningful. Perhaps the most widely used vintage factor analysis is the Principal Component Analysis (PCA) followed by the varimax rotation. Despite its popularity, little theoretical guarantee can be provided mainly because varimax rotation requires to solve a non-convex optimization over the set of orthogonal matrices. In this paper, we propose a deflation varimax procedure that solves each row of an orthogonal matrix sequentially. In addition to its net computational gain and flexibility, we are able to fully establish theoretical guarantees for the proposed procedure in a broad context. Adopting this new varimax approach as the second step after PCA, we further analyze this two step procedure under a general class of factor models. Our results show that it estimates the factor loading matrix in the optimal rate when the signal-to-noise-ratio (SNR) is moderate or large. In the low SNR regime, we offer possible improvement over using PCA and the deflation procedure when the additive noise under the factor model is structured. The modified procedure is shown to be optimal in all SNR regimes. Our theory is valid for finite sample and allows the number of the latent factors to grow with the sample size as well as the ambient dimension to grow with, or even exceed, the sample size. Extensive simulation and real data analysis further corroborate our theoretical findings.

翻译：经典因子分析中的一类重要方法是先对原始数据进行低维表示，再通过旋转使得该低维表示具有科学解释性，其中应用最广泛的当属主成分分析（PCA）结合方差极大旋转（varimax）。尽管该方法被广泛使用，但由于方差极大旋转需要在正交矩阵集合上求解非凸优化问题，其理论保证长期缺失。本文提出一种逐次逼近的方差极大旋转方法（deflation varimax），通过依次求解正交矩阵的每一行来实现优化。该方法不仅具有计算效率高和灵活性强的优势，还能在广泛场景下为其建立完整的理论保证。将这种新方差极大旋转作为PCA的第二步后，我们进一步在一般因子模型框架下分析该两步流程。理论表明，当信噪比（SNR）中等或较高时，该方法能以最优速率估计因子载荷矩阵。在低信噪比场景下，若因子模型中的加性噪声具有特定结构，我们提出的改进方法可在PCA和逐次逼近流程基础上实现性能提升。改进后的方法在所有信噪比范围内均达到最优性。我们的理论适用于有限样本情况，允许潜在因子数量随样本量增长，也允许数据维数随样本量增长甚至超过样本量。大量模拟实验和真实数据分析进一步验证了理论发现。