Principal component analysis (PCA) is commonly used in genetics to infer and visualize population structure and admixture between populations. PCA is often interpreted in a way similar to inferred admixture proportions, where it is assumed that individuals belong to one of several possible populations or are admixed between these populations. We propose a new method to assess the statistical fit of PCA (interpreted as a model spanned by the top principal components) and to show that violations of the PCA assumptions affect the fit. Our method uses the chosen top principal components to predict the genotypes. By assessing the covariance (and the correlation) of the residuals (the differences between observed and predicted genotypes), we are able to detect violation of the model assumptions. Based on simulations and genome wide human data we show that our assessment of fit can be used to guide the interpretation of the data and to pinpoint individuals that are not well represented by the chosen principal components. Our method works equally on other similar models, such as the admixture model, where the mean of the data is represented by linear matrix decomposition.
翻译:主成分分析(PCA)在遗传学中常被用于推断并可视化群体结构及群体间的混合现象。PCA通常以类似于推断的混合比例方式进行解读,即假设个体属于若干个可能群体之一,或为这些群体间的混合个体。我们提出一种新方法,用于评估PCA(作为由前几个主成分张成的模型的解释性)的统计拟合度,并揭示PCA假设的违反会如何影响拟合效果。该方法利用所选前几个主成分预测基因型。通过分析残差(观测基因型与预测基因型之差)的协方差(及相关系数),我们能够检测模型假设的违反情况。基于模拟实验和全基因组人类数据,我们证明该拟合评估方法可用于指导数据解读,并精准定位那些未被所选主成分良好表征的个体。该方法同样适用于其他相似模型,例如混合模型(其数据均值由线性矩阵分解表示)。