On self-training of summary data with genetic applications

Prediction model training is often hindered by limited access to individual-level data due to privacy concerns and logistical challenges, particularly in biomedical research. Resampling-based self-training presents a promising approach for building prediction models using only summary-level data. These methods leverage summary statistics to sample pseudo datasets for model training and parameter optimization, allowing for model development without individual-level data. Although increasingly used in precision medicine, the general behaviors of self-training remain unexplored. In this paper, we leverage a random matrix theory framework to establish the statistical properties of self-training algorithms for high-dimensional sparsity-free summary data. We demonstrate that, within a class of linear estimators, resampling-based self-training achieves the same asymptotic predictive accuracy as conventional training methods that require individual-level datasets. These results suggest that self-training with only summary data incurs no additional cost in prediction accuracy, while offering significant practical convenience. Our analysis provides several valuable insights and counterintuitive findings. For example, while pseudo-training and validation datasets are inherently dependent, their interdependence unexpectedly cancels out when calculating prediction accuracy measures, preventing overfitting in self-training algorithms. Furthermore, we extend our analysis to show that the self-training framework maintains this no-cost advantage when combining multiple methods or when jointly training on data from different distributions. We numerically validate our findings through simulations and real data analyses using the UK Biobank. Our study highlights the potential of resampling-based self-training to advance genetic risk prediction and other fields that make summary data publicly available.

翻译：预测模型的训练常因个体层面数据获取受限而受阻，这主要源于隐私顾虑与操作挑战，在生物医学研究中尤为突出。基于重采样的自训练为仅使用摘要级数据构建预测模型提供了一种前景广阔的方法。这类方法利用摘要统计量对伪数据集进行采样以完成模型训练与参数优化，从而实现在无需个体层面数据的情况下开发模型。尽管自训练在精准医学中的应用日益增多，其一般性规律仍未得到充分探索。本文借助随机矩阵理论框架，建立了针对高维无稀疏性摘要数据的自训练算法的统计特性。我们证明，在线性估计器类别内，基于重采样的自训练能达到与传统训练方法相同的渐近预测精度，而后者需要个体层面数据集。这些结果表明，仅使用摘要数据的自训练不会在预测精度上产生额外代价，同时提供了显著的实际便利性。我们的分析得出了若干有价值的见解及反直觉发现：例如，虽然伪训练集与验证集本质上是相互依赖的，但在计算预测精度指标时，它们的相互依赖关系会意外抵消，从而防止自训练算法出现过拟合现象。此外，我们通过扩展分析证明，当整合多种方法或对来自不同分布的数据进行联合训练时，自训练框架仍能保持这种无代价优势。我们通过模拟实验和基于英国生物银行的实际数据分析对研究结果进行了数值验证。本研究凸显了基于重采样的自训练在推进遗传风险预测及其他公开提供摘要数据的领域中的潜力。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

《生成式模型: 变分自编码器与扩散模型》，75页ppt，Google DeepMind科学家Ruiqi Gao

专知会员服务

66+阅读 · 2023年6月10日

Nat. Biotechnol. | 机器学习为生物库驱动的药物发现提供动力

专知会员服务

11+阅读 · 2022年9月12日

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

生成性对抗网络:理论模型、评估指标和最近发展的概述，Generative Adversarial Networks (GANs): An Overview of Theoretical Model, Evaluation Metrics, and Recent Developments

专知会员服务

42+阅读 · 2020年5月30日