Variable selection has played a critical role in modern statistical learning and scientific discoveries. Numerous regularization and Bayesian variable selection methods have been developed in the past two decades for variable selection, but most of these methods consider selecting variables for only one response. As more data is being collected nowadays, it is common to analyze multiple related responses from the same study. Existing multivariate variable selection methods select variables for all responses without considering the possible heterogeneity across different responses, i.e. some features may only predict a subset of responses but not the rest. Motivated by the multi-trait fine mapping problem in genetics to identify the causal variants for multiple related traits, we developed a novel multivariate Bayesian variable selection method to select critical predictors from a large number of grouped predictors that target at multiple correlated and possibly heterogeneous responses. Our new method is featured by its selection at multiple levels, its incorporation of prior biological knowledge to guide selection and identification of best subset of responses predictors target at. We showed the advantage of our method via extensive simulations and a real fine mapping example to identify causal variants associated with different subsets of addictive behaviors.
翻译:变量选择在现代统计学习和科学发现中扮演着关键角色。过去二十年中,针对变量选择问题已发展出众多正则化方法与贝叶斯变量选择技术,但绝大多数方法仅针对单一响应变量进行特征筛选。随着数据采集能力的提升,同一研究中分析多个相关响应变量已成为常态。现有多元变量选择方法将所有响应变量纳入统一框架进行特征筛选,却未能考虑不同响应变量间可能存在的异质性——即某些预测因子可能仅作用于部分响应变量而与其他变量无关。受遗传学中多位点精细定位问题(旨在识别多性状关联的因果变异)的启发,我们提出了一种新型多元贝叶斯变量选择方法,能够从大量分组预测因子中筛选出针对多个相关且可能异质的响应变量的关键预测因子。本方法的核心创新在于:(1)实现多层级变量筛选;(2)整合先验生物学知识指导变量选择;(3)自动识别最优响应变量子集对应的预测因子。通过大规模模拟实验与真实遗传精细定位案例(识别与不同成瘾行为子集相关的因果变异),我们验证了该方法相对现有技术的显著优势。