Confirmatory factor analysis (CFA) is a statistical method for identifying and confirming the presence of latent factors among observed variables through the analysis of their covariance structure. Compared to alternative factor models, CFA offers interpretable common factors with enhanced specificity and a more adaptable approach to modeling covariance structures. However, the application of CFA has been limited by the requirement for prior knowledge about "non-zero loadings" and by the lack of computational scalability (e.g., it can be computationally intractable for hundreds of observed variables). We propose a data-driven semi-confirmatory factor analysis (SCFA) model that attempts to alleviate these limitations. SCFA automatically specifies "non-zero loadings" by learning the network structure of the large covariance matrix of observed variables, and then offers closed-form estimators for factor loadings, factor scores, covariances between common factors, and variances between errors using the likelihood method. Therefore, SCFA is applicable to high-throughput datasets (e.g., hundreds of thousands of observed variables) without requiring prior knowledge about "non-zero loadings". Through an extensive simulation analysis benchmarking against standard packages, SCFA exhibits superior performance in estimating model parameters with a much reduced computational time. We illustrate its practical application through factor analysis on a high-dimensional RNA-seq gene expression dataset.
翻译:验证性因子分析(CFA)是一种通过分析观测变量协方差结构来识别并确认潜在因子存在的统计方法。与其他因子模型相比,CFA能够提供更具特异性的可解释公共因子,并以更灵活的方式建模协方差结构。然而,CFA的应用受限于对"非零载荷"的先验知识需求,以及缺乏计算可扩展性(例如,在数百个观测变量场景下可能面临计算不可行问题)。本文提出一种数据驱动的半验证性因子分析(SCFA)模型,旨在缓解上述局限。SCFA通过学习观测变量大协方差矩阵的网络结构自动指定"非零载荷",进而利用似然方法提供因子载荷、因子得分、公共因子间协方差及误差方差的闭式估计量。因此,SCFA无需"非零载荷"的先验知识即可适用于高通量数据集(例如数十万个观测变量)。通过与标准软件包的广泛模拟分析对比,SCFA在模型参数估计方面展现出更优性能,同时大幅缩短计算时间。我们通过对高维RNA-seq基因表达数据集进行因子分析,展示了其实际应用价值。