We study a general factor analysis framework where the $n$-by-$p$ data matrix is assumed to follow a general exponential family distribution entry-wise. While this model framework has been proposed before, we here further relax its distributional assumption by using a quasi-likelihood setup. By parameterizing the mean-variance relationship on data entries, we additionally introduce a dispersion parameter and entry-wise weights to model large variations and missing values. The resulting model is thus not only robust to distribution misspecification but also more flexible and able to capture non-Gaussian covariance structures of the data matrix. Our main focus is on efficient computational approaches to perform the factor analysis. Previous modeling frameworks rely on simulated maximum likelihood (SML) to find the factorization solution, but this method was shown to lead to asymptotic bias when the simulated sample size grows slower than the square root of the sample size $n$, eliminating its practical application for data matrices with large $n$. Borrowing from expectation-maximization (EM) and stochastic gradient descent (SGD), we investigate three estimation procedures based on iterative factorization updates. Our proposed solution does not show asymptotic biases, and scales even better for large matrix factorizations with error $O(1/p)$. To support our findings, we conduct simulation experiments and discuss its application in three case studies.
翻译:我们研究了一个通用的因子分析框架,其中假设$n\times p$数据矩阵的每个元素服从一般指数族分布。虽然该模型框架先前已被提出,但本文通过采用拟似然设置进一步放宽了其分布假设。通过对数据元素的均值-方差关系进行参数化,我们还引入了离散参数和元素权重来建模大变异和缺失值。因此,所得模型不仅对分布误设具有鲁棒性,而且更灵活,能够捕捉数据矩阵的非高斯协方差结构。我们的主要研究重点是执行因子分析的高效计算方法。先前的建模框架依赖模拟最大似然估计来寻找因子化解,但该方法被证明当模拟样本量的增长速度慢于样本量$n$的平方根时会导致渐近偏差,从而消除了其在具有大$n$的数据矩阵中的实际应用。借鉴期望最大化算法和随机梯度下降法,我们研究了三种基于迭代因子化更新的估计程序。我们提出的解决方案未显示渐近偏差,并且对于大型矩阵分解具有$O(1/p)$的误差,可扩展性更优。为支持我们的发现,我们进行了模拟实验并讨论了其在三个案例研究中的应用。