In the era of precision medicine, genome-wide epigenetic modifications offer rich data that could inform risk prediction. However, these data are high-dimensional and exhibit complex dependence structures, which makes it difficult to jointly model them with low-dimensional covariates when the goal is to obtain interpretable effect estimates for covariate adjustment. Standard Bayesian additive regression trees (BART) provide strong predictive performance but treat all predictors uniformly within the tree ensemble, obscuring the contributions of significant covariates and complicating variable selection in high-dimensional settings. We propose a semi-parametric BART model (spBART) that addresses this limitation by modeling low-dimensional covariates through a parametric component with interpretable coefficients, while capturing complex nonlinear associations among high-dimensional predictors through the tree ensemble. To perform stable variable selection, we develop a cross-validation-based procedure that aggregates posterior inclusion probabilities across folds and applies Bayesian false discovery rate control. We apply the proposed method to a pooled case--control analysis of high-dimensional genome-wide 5-hydroxymethylcytosine profiles derived from circulating cell-free DNA in two multiple myeloma studies ($N = 869$). The approach identifies a parsimonious set of candidate loci and achieves strong out-of-sample discrimination (AUC $= 0.96$) in a held-out validation set. Overall, spBART provides a unified framework for combining interpretable covariate inference with flexible modeling and variable selection in high-dimensional biomedical studies.
翻译:在精准医学时代,全基因组表观遗传修饰为风险预测提供了丰富的数据。然而,这些数据具有高维性和复杂依赖结构,当目标是获取协变量调整的可解释效应估计时,这使得将高维数据与低维协变量联合建模变得困难。标准贝叶斯加法回归树(BART)具有强大的预测性能,但在树集成中对所有预测变量一视同仁,从而掩盖了重要协变量的贡献,并复杂化了高维环境下的变量选择。我们提出一种半参数BART模型(spBART),通过参数组件对低维协变量进行建模(具有可解释系数),同时通过树集成捕捉高维预测变量间的复杂非线性关联,从而克服了这一限制。为执行稳定的变量选择,我们开发了一种基于交叉验证的程序,该程序汇总各折的后验包含概率,并应用贝叶斯假发现率控制。我们将所提方法应用于两项多发性骨髓瘤研究($N = 869$)中来自循环游离DNA的高维全基因组5-羟甲基胞嘧啶谱的合并病例-对照分析。该方法识别出一组简约的候选位点,并在保留验证集中实现了强大的离样判别能力(AUC $= 0.96$)。总体而言,spBART为在高维生物医学研究中将可解释协变量推断与灵活建模及变量选择相结合提供了一个统一框架。