Epigenetic aging clocks play a pivotal role in estimating an individual's biological age through the examination of DNA methylation patterns at numerous CpG (Cytosine-phosphate-Guanine) sites within their genome. However, making valid inferences on predicted epigenetic ages, or more broadly, on predictions derived from high-dimensional inputs, presents challenges. We introduce a novel U-learning approach via combinatory multi-subsampling for making ensemble predictions and constructing confidence intervals for predictions of continuous outcomes when traditional asymptotic methods are not applicable. More specifically, our approach conceptualizes the ensemble estimators within the framework of generalized U-statistics and invokes the Hájek projection for deriving the variances of predictions and constructing confidence intervals with valid conditional coverage probabilities. We apply our approach to two commonly used predictive algorithms, Lasso and deep neural networks (DNNs), and illustrate the validity of inferences with extensive numerical studies. We have applied these methods to predict the DNA methylation age (DNAmAge) of patients with various health conditions, aiming to accurately characterize the aging process and potentially guide anti-aging interventions.
翻译:表观遗传衰老时钟通过检测个体基因组中大量CpG(胞嘧啶-磷酸-鸟嘌呤)位点的DNA甲基化模式,在评估个体生物学年龄方面发挥着关键作用。然而,对预测的表观遗传年龄(或更广义而言,对高维输入产生的预测结果)进行有效推断仍面临挑战。本文提出一种基于组合多重子采样的新型U学习方法,用于在传统渐近方法不适用时构建连续结果预测的集成估计及置信区间。具体而言,该方法将集成估计器置于广义U统计量框架中进行概念化,并借助Hájek投影推导预测方差,从而构建具有有效条件覆盖概率的置信区间。我们将该方法应用于两种常用预测算法——LASSO和深度神经网络(DNN),并通过大量数值研究验证推断的有效性。这些方法已应用于预测不同健康状况患者的DNA甲基化年龄(DNAmAge),旨在精确表征衰老过程,并为抗衰老干预提供潜在指导。