Recent evidence highlights the usefulness of DNA methylation (DNAm) biomarkers as surrogates for exposure to risk factors for non-communicable diseases in epidemiological studies and randomized trials. DNAm variability has been demonstrated to be tightly related to lifestyle behavior and exposure to environmental risk factors, ultimately providing an unbiased proxy of an individual state of health. At present, the creation of DNAm surrogates relies on univariate penalized regression models, with elastic-net regularizer being the gold standard when accomplishing the task. Nonetheless, more advanced modeling procedures are required in the presence of multivariate outcomes with a structured dependence pattern among the study samples. In this work we propose a general framework for mixed-effects multitask learning in presence of high-dimensional predictors to develop a multivariate DNAm biomarker from a multi-center study. A penalized estimation scheme based on an expectation-maximization algorithm is devised, in which any penalty criteria for fixed-effects models can be conveniently incorporated in the fitting process. We apply the proposed methodology to create novel DNAm surrogate biomarkers for multiple correlated risk factors for cardiovascular diseases and comorbidities. We show that the proposed approach, modeling multiple outcomes together, outperforms state-of-the-art alternatives, both in predictive power and bio-molecular interpretation of the results.
翻译:最新证据表明,DNA甲基化(DNAm)生物标志物在流行病学研究和随机对照试验中可作为非传染性疾病风险因素暴露的有效替代指标。研究证实DNA甲基化变异与生活方式行为及环境风险因素暴露密切相关,最终能提供个体健康状态的无偏代理指标。当前DNA甲基化替代物的生成主要依赖单变量惩罚回归模型,其中弹性网络正则化方法已成为完成该任务的标准方案。然而,当存在具有结构化依赖模式的多变量结局变量及研究样本时,需要更先进的建模流程。本研究提出一种面向高维预测变量的混合效应多任务学习通用框架,用于从多中心研究中开发多变量DNA甲基化生物标志物。我们设计了基于期望最大化算法的惩罚估计方案,该方案可灵活地将任意固定效应模型惩罚准则纳入拟合过程。通过将该方法应用于心血管疾病及其合并症的多重相关风险因素,我们成功创建了新型DNA甲基化替代生物标志物。研究结果表明,这种联合建模多结局变量的方法在预测性能与结果的生物分子解释方面均优于现有先进方案。