In the era of precision medicine, genome-wide epigenetic modifications offer rich data that could inform risk prediction. However, these data are high-dimensional and exhibit complex dependence structures, which makes it difficult to jointly model them with low-dimensional covariates when the goal is to obtain interpretable effect estimates for covariate adjustment. Standard Bayesian additive regression trees (BART) provide strong predictive performance but treat all predictors uniformly within the tree ensemble, obscuring the contributions of significant covariates and complicating variable selection in high-dimensional settings. We propose a semi-parametric BART model (spBART) that addresses this limitation by modeling low-dimensional covariates through a parametric component with interpretable coefficients, while capturing complex nonlinear associations among high-dimensional predictors through the tree ensemble. To perform stable variable selection, we develop a cross-validation-based procedure that aggregates posterior inclusion probabilities across folds and applies Bayesian false discovery rate control. We apply the proposed method to a pooled case--control analysis of high-dimensional genome-wide 5-hydroxymethylcytosine profiles derived from circulating cell-free DNA in two multiple myeloma studies ($N = 869$). The approach identifies a parsimonious set of candidate loci and achieves strong out-of-sample discrimination (AUC $= 0.96$) in a held-out validation set. Overall, spBART provides a unified framework for combining interpretable covariate inference with flexible modeling and variable selection in high-dimensional biomedical studies.
翻译:在精准医学时代,全基因组表观遗传修饰为风险预测提供了丰富的数据资源。然而,这些数据具有高维性和复杂依赖结构,当需要获得可解释的协变量调整效应估计时,难以将其与低维协变量进行联合建模。标准贝叶斯加性回归树(BART)虽具备强大的预测能力,但会在树集成中统一处理所有预测变量,掩盖重要协变量的贡献,并增加高维场景下的变量选择难度。本文提出半参数贝叶斯加性回归树模型(spBART),通过可解释系数的参数化组件对低维协变量建模,同时借助树集成捕捉高维预测变量间的复杂非线性关联,从而突破上述局限。为实现稳定的变量选择,我们开发了一种基于交叉验证的流程,该流程跨折汇总后验包含概率,并应用贝叶斯错误发现率控制。我们将该方法应用于两项多发性骨髓瘤研究($N = 869$)中基于循环游离DNA的全基因组5-羟甲基胞嘧啶高维谱合并病例-对照分析。该方法识别出简洁的候选位点集合,并在独立验证集中达到优异的样本外判别能力(AUC $= 0.96$)。总体而言,spBART为高维生物医学研究中实现可解释协变量推断、灵活建模与变量选择的统一框架提供了新途径。