The high dimensional nature of genomics data complicates feature selection, in particular in low sample size studies - not uncommon in clinical prediction settings. It is widely recognized that complementary data on the features, `co-data', may improve results. Examples are prior feature groups or p-values from a related study. Such co-data are ubiquitous in genomics settings due to the availability of public repositories. Yet, the uptake of learning methods that structurally use such co-data is limited. We review guided adaptive shrinkage methods: a class of regression-based learners that use co-data to adapt the shrinkage parameters, crucial for the performance of those learners. We discuss technical aspects, but also the applicability in terms of types of co-data that can be handled. This class of methods is contrasted with several others. In particular, group-adaptive shrinkage is compared with the better-known sparse group-lasso by evaluating feature selection. Finally, we demonstrate the versatility of the guided shrinkage methodology by showing how to `do-it-yourself': we integrate implementations of a co-data learner and the spike-and-slab prior for the purpose of improving feature selection in genetics studies.
翻译:基因组数据的高维特性使特征选择复杂化,尤其在样本量较小的临床预测研究中更为突出。业界普遍认识到,特征的补充数据(即“副数据”)可改善分析结果,例如先验特征分组或相关研究的p值。由于公共数据库的可得性,此类副数据在基因组学场景中普遍存在,但结构化利用副数据的学习方法仍应用有限。本文综述了引导式自适应收缩方法:一类基于回归的学习器,通过副数据调整对学习器性能至关重要的收缩参数。我们探讨了技术细节,同时分析了该方法在可处理副数据类型方面的适用性。将该类方法与若干其他方法进行了对比研究,特别通过评估特征选择效果,将组自适应收缩与更为知名的稀疏组套索进行比较。最后,我们通过展示“自主实施”流程来证明引导式收缩方法的通用性:整合副数据学习器与尖刺-板状先验的实现,以改进遗传学研究中的特征选择。