Semi-Confirmatory Factor Analysis for High-Dimensional Data with Interconnected Community Structures

We propose a novel data-driven semi-confirmatory factor analysis (SCFA) model that addresses the absence of model specification and handles the estimation and inference tasks with high-dimensional data. Confirmatory factor analysis (CFA) is a prevalent and pivotal technique for statistically validating the covariance structure of latent common factors derived from multiple observed variables. In contrast to other factor analysis methods, CFA offers a flexible covariance modeling approach for common factors, enhancing the interpretability of relationships between the common factors, as well as between common factors and observations. However, the application of classic CFA models faces dual barriers: the lack of a prerequisite specification of "non-zero loadings" or factor membership (i.e., categorizing the observations into distinct common factors), and the formidable computational burden in high-dimensional scenarios where the number of observed variables surpasses the sample size. To bridge these two gaps, we propose the SCFA model by integrating the underlying high-dimensional covariance structure of observed variables into the CFA model. Additionally, we offer computationally efficient solutions (i.e., closed-form uniformly minimum variance unbiased estimators) and ensure accurate statistical inference through closed-form exact variance estimators for all model parameters and factor scores. Through an extensive simulation analysis benchmarking against standard computational packages, SCFA exhibits superior performance in estimating model parameters and recovering factor scores, while substantially reducing the computational load, across both low- and high-dimensional scenarios. It exhibits moderate robustness to model misspecification. We illustrate the practical application of the SCFA model by conducting factor analysis on a high-dimensional gene expression dataset.

翻译：我们提出了一种新颖的数据驱动半证实性因子分析（SCFA）模型，该模型解决了模型规范缺失的问题，并处理了高维数据下的估计与推断任务。证实性因子分析（CFA）是一种广泛使用且至关重要的技术，用于统计验证由多个观测变量导出的潜在共同因子的协方差结构。与其他因子分析方法相比，CFA为共同因子提供了灵活的协方差建模方法，增强了共同因子之间以及共同因子与观测变量之间关系的可解释性。然而，经典CFA模型的应用面临双重障碍：缺乏“非零载荷”或因子成员归属（即，将观测变量分类到不同的共同因子）的先验规范，以及在观测变量数量超过样本量的高维场景中面临巨大的计算负担。为弥合这两个缺口，我们通过将观测变量的高维协方差结构整合到CFA模型中，提出了SCFA模型。此外，我们提供了计算高效的解决方案（即闭合形式的均匀最小方差无偏估计量），并通过所有模型参数和因子得分的闭合形式精确方差估计量确保了准确的统计推断。通过与标准计算包进行基准测试的广泛模拟分析，SCFA在低维和高维场景下，在估计模型参数和恢复因子得分方面均表现出优越性能，同时大幅降低了计算负荷。该模型对模型误设表现出适度的稳健性。我们通过对一个高维基因表达数据集进行因子分析，展示了SCFA模型的实际应用。