In many scientific disciplines, the features of interest cannot be observed directly, so must instead be inferred from observed behaviour. Latent variable analyses are increasingly employed to systematise these inferences, and Principal Components Analysis (PCA) is perhaps the simplest and most popular of these methods. Here, we examine how the assumptions that we are prepared to entertain, about the latent variable system, mediate the likelihood that PCA-derived components will capture the true sources of variance underlying data. As expected, we find that this likelihood is excellent in the best case, and robust to empirically reasonable levels of measurement noise, but best-case performance is also: (a) not robust to violations of the method's more prominent assumptions, of linearity and orthogonality; and also (b) requires that other subtler assumptions be made, such as that the latent variables should have varying importance, and that weights relating latent variables to observed data have zero mean. Neither variance explained, nor replication in independent samples, could reliably predict which (if any) PCA-derived components will capture true sources of variance in data. We conclude by describing a procedure to fit these inferences more directly to empirical data, and use it to find that components derived via PCA from two different empirical neuropsychological datasets, are less likely to have meaningful referents in the brain than we hoped.
翻译:在许多科学学科中,感兴趣的构念无法直接观测,而必须从观测到的行为中推断。潜变量分析越来越多地被用于系统化这些推断,其中主成分分析(PCA)可能是最简单且最常用的方法。本文研究了我们对潜变量系统所持假设如何影响PCA衍生成分捕捉数据背后真实变异来源的可能性。正如预期,我们发现这种可能性在最优情况下极高,且对经验上合理的测量噪声水平具有稳健性;但最优表现同时具有以下特征:(a) 对该方法更显著假设(如线性和正交性)的违反不具稳健性;以及(b) 需要满足其他更微妙的假设,例如潜变量应具有不同的重要性,且连接潜变量与观测数据的权重均值为零。方差解释率或独立样本中的复现性均无法可靠预测哪些(如有)PCA衍生成分能够捕捉数据中的真实变异来源。最后,我们描述了一种更直接地将这些推断拟合至经验数据的程序,并利用该程序发现:从两个不同经验神经心理学数据集中通过PCA得到的成分,在大脑中具有有意义参照物的可能性低于预期。