Uncertainty of high-dimensional genetic data prediction with polygenic risk scores

In many predictive tasks, there are a large number of true predictors with weak signals, leading to substantial uncertainties in prediction outcomes. The polygenic risk score (PRS) is an example of such a scenario, where many genetic variants are used as predictors for complex traits, each contributing only a small amount of information. Although PRS has been a standard tool in genetic predictions, its uncertainty remains largely unexplored. In this paper, we aim to establish the asymptotic normality of PRS in high-dimensional predictions without sparsity constraints. We investigate the popular marginal and ridge-type estimators in PRS applications, developing central limit theorems for both individual-level predicted values (e.g., genetically predicted human height) and cohort-level prediction accuracy measures (e.g., overall predictive $R$-squared in the testing dataset). Our results demonstrate that ignoring the prediction-induced uncertainty can lead to substantial underestimation of the true variance of PRS-based estimators, which in turn may cause overconfidence in the accuracy of confidence intervals and hypothesis testing. These findings provide key insights omitted by existing first-order asymptotic studies of high-dimensional sparsity-free predictions, which often focus solely on the point limits of predictive risks. We develop novel and flexible second-order random matrix theory results to assess the asymptotic normality of functionals with a general covariance matrix, without assuming Gaussian distributions for the data. We evaluate our theoretical results through extensive numerical analyses using real data from the UK Biobank. Our analysis underscores the importance of incorporating uncertainty assessments at both the individual and cohort levels when applying and interpreting PRS.

翻译：在许多预测任务中，存在大量信号微弱的真实预测因子，导致预测结果存在显著不确定性。多基因风险评分（PRS）正是此类场景的典型案例，其中大量遗传变异被用作复杂性状的预测因子，每个变异仅贡献少量信息。尽管PRS已成为遗传预测的标准工具，但其不确定性在很大程度上尚未得到充分探索。本文旨在建立无稀疏性约束的高维预测中PRS的渐近正态性。我们研究了PRS应用中常用的边际估计量与岭型估计量，为个体层面预测值（例如遗传预测的人类身高）和队列层面预测精度指标（例如测试数据集中整体预测$R$方）分别建立了中心极限定理。研究结果表明，忽略预测引发的不确定性会导致基于PRS的估计量真实方差被严重低估，进而可能造成对置信区间精度与假设检验的过度自信。这些发现揭示了现有高维无稀疏性预测一阶渐近研究中被忽略的关键问题——此类研究往往仅关注预测风险的点极限。我们建立了新颖且灵活的二阶随机矩阵理论结果，用于评估具有一般协方差矩阵的泛函的渐近正态性，且无需假设数据服从高斯分布。通过使用英国生物银行真实数据进行大量数值分析，我们验证了理论结果。本研究强调在应用和解释PRS时，必须在个体与队列层面同时纳入不确定性评估的重要性。