Principal Component Analysis (PCA) is one of the most commonly used statistical methods for data exploration, and for dimensionality reduction wherein the first few principal components account for an appreciable proportion of the variability in the data. Less commonly, attention is paid to the last principal components because they do not account for an appreciable proportion of variability. However, this defining characteristic of the last principal components also qualifies them as combinations of variables that are constant across the cases. Such constant-combinations are important because they may reflect underlying laws of nature. In situations involving a large number of noisy covariates, the underlying law may not correspond to the last principal component, but rather to one of the last. Consequently, a criterion is required to identify the relevant eigenvector. In this paper, two examples are employed to demonstrate the proposed methodology; one from Physics, involving a small number of covariates, and another from Meteorology wherein the number of covariates is in the thousands. It is shown that with an appropriate selection criterion, PCA can be employed to ``discover" Kepler's third law (in the former), and the hypsometric equation (in the latter).
翻译:主成分分析(PCA)是最常用的数据探索与降维统计方法之一,其通过前几个主成分解释数据中大部分变异性。相较之下,最后一个主成分往往因对变异性贡献有限而鲜受关注。然而,这些尾端主成分的本质特征恰恰使其成为跨案例保持恒定的变量组合——这类恒定组合可能反映潜在的自然规律。当存在大量含噪协变量时,潜在规律可能对应的是末位主成分之一而非最后一个。因此,需要建立准则来识别相关特征向量。本文通过两个实例展示该方法:其一是涉及少量协变量的物理学案例,其二是协变量数量达数千的气象学案例。研究表明,通过适当的筛选准则,PCA可成功"发现"开普勒第三定律(前者)与气压方程(后者)。