We study a general factor analysis framework where the $n$-by-$p$ data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture non-Gaussian covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size $n$, eliminating its practical application for data matrices with large $n$. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error $O(1/p)$. To support our findings, we conduct simulation experiments and discuss its application in three case studies.
翻译:我们研究了一种通用因子分析框架,其中$n \times p$数据矩阵假定逐元素服从一般指数族分布。尽管该模型框架此前已被提出,但我们在此通过使用拟似然设定进一步放宽其分布假设。通过对数据条目的均值-方差关系进行参数化,我们额外引入离散参数和逐元素权重来建模大变异和缺失值。因此,所得模型不仅对分布误设具有鲁棒性,而且更加灵活,能够捕捉数据矩阵的非高斯协方差结构。我们的主要关注点在于实现因子分析的高效计算方法。先前的建模框架依赖模拟最大似然(SML)来求解因子分解,但该方法被证明当模拟样本量增长慢于样本量$n$的平方根时会导致渐近偏差,从而限制了其在$n$较大的数据矩阵中的实际应用。借鉴期望最大化(EM)和随机梯度下降(SGD)的思想,我们研究了三种基于迭代因子更新的估计程序。我们提出的方法不呈现渐近偏差,并且对于误差为$O(1/p)$的大矩阵因子分解具有更好的可扩展性。为支持我们的发现,我们进行了模拟实验,并在三个案例研究中讨论了其应用。