Reference panel-based estimators have become widely used in genetic prediction of complex traits due to their ability to address data privacy concerns and reduce computational and communication costs. These estimators estimate the covariance matrix of predictors using an external reference panel, instead of relying solely on the original training data. In this paper, we investigate the performance of reference panel-based $L_1$ and $L_2$ regularized estimators within a unified framework based on approximate message passing (AMP). We uncover several key factors that influence the accuracy of reference panel-based estimators, including the sample sizes of the training data and reference panels, the signal-to-noise ratio, the underlying sparsity of the signal, and the covariance matrix among predictors. Our findings reveal that, even when the sample size of the reference panel matches that of the training data, reference panel-based estimators tend to exhibit lower accuracy compared to traditional regularized estimators. Furthermore, we observe that this performance gap widens as the amount of training data increases, highlighting the importance of constructing large-scale reference panels to mitigate this issue. To support our theoretical analysis, we develop a novel non-separable matrix AMP framework capable of handling the complexities introduced by a general covariance matrix and the additional randomness associated with a reference panel. We validate our theoretical results through extensive simulation studies and real data analyses using the UK Biobank database.
翻译:参考面板估计量因能够解决数据隐私问题并降低计算和通信成本,已广泛应用于复杂性状的遗传预测中。这些估计量利用外部参考面板估算预测变量的协方差矩阵,而非仅依赖原始训练数据。本文基于近似消息传递框架,在统一框架下研究了基于参考面板的 $L_1$ 和 $L_2$ 正则化估计量的性能。我们揭示了影响参考面板估计量精度的若干关键因素,包括训练数据和参考面板的样本量、信噪比、信号的潜在稀疏性以及预测变量间的协方差矩阵。研究结果表明,即使参考面板样本量与训练数据相匹配,参考面板估计量的精度通常仍低于传统正则化估计量。此外,我们观察到随着训练数据量的增加,这一性能差距进一步扩大,凸显了构建大规模参考面板以缓解该问题的重要性。为支撑理论分析,我们开发了一种新颖的非可分离矩阵AMP框架,能够处理一般协方差矩阵及参考面板引入的额外随机性带来的复杂性。我们通过广泛的模拟研究和基于UK Biobank数据库的真实数据分析验证了理论结果。