Evaluation of population structure inferred by principal component analysis or the admixture model

Principal component analysis (PCA) is commonly used in genetics to infer and visualize population structure and admixture between populations. PCA is often interpreted in a way similar to inferred admixture proportions, where it is assumed that individuals belong to one of several possible populations or are admixed between these populations. We propose a new method to assess the statistical fit of PCA (interpreted as a model spanned by the top principal components) and to show that violations of the PCA assumptions affect the fit. Our method uses the chosen top principal components to predict the genotypes. By assessing the covariance (and the correlation) of the residuals (the differences between observed and predicted genotypes), we are able to detect violation of the model assumptions. Based on simulations and genome wide human data we show that our assessment of fit can be used to guide the interpretation of the data and to pinpoint individuals that are not well represented by the chosen principal components. Our method works equally on other similar models, such as the admixture model, where the mean of the data is represented by linear matrix decomposition.

翻译：主成分分析（PCA）在遗传学中常被用于推断并可视化群体结构及群体间的混合现象。PCA通常以类似于推断的混合比例方式进行解读，即假设个体属于若干个可能群体之一，或为这些群体间的混合个体。我们提出一种新方法，用于评估PCA（作为由前几个主成分张成的模型的解释性）的统计拟合度，并揭示PCA假设的违反会如何影响拟合效果。该方法利用所选前几个主成分预测基因型。通过分析残差（观测基因型与预测基因型之差）的协方差（及相关系数），我们能够检测模型假设的违反情况。基于模拟实验和全基因组人类数据，我们证明该拟合评估方法可用于指导数据解读，并精准定位那些未被所选主成分良好表征的个体。该方法同样适用于其他相似模型，例如混合模型（其数据均值由线性矩阵分解表示）。

相关内容

PCA

关注 3

在统计中，主成分分析（PCA）是一种通过最大化每个维度的方差来将较高维度空间中的数据投影到较低维度空间中的方法。给定二维，三维或更高维空间中的点集合，可以将“最佳拟合”线定义为最小化从点到线的平均平方距离的线。可以从垂直于第一条直线的方向类似地选择下一条最佳拟合线。重复此过程会产生一个正交的基础，其中数据的不同单个维度是不相关的。这些基向量称为主成分。

不可错过！杜克大学《因果推断》课程，全面讲述因果推理

专知会员服务

52+阅读 · 2022年10月22日

【经典书】量化金融导论，192页pdf，哈佛大学Stephen Blyth著作

专知会员服务

97+阅读 · 2022年4月3日

71页PDF，Intro to the Metaverse（元宇宙概念发展透析），Newzoo Trend Report 2021

专知会员服务

22+阅读 · 2022年2月19日

INRIA 最新《机器学习理论》课程笔记，176页pdf

专知会员服务

52+阅读 · 2020年12月14日