We consider the problem of sparse variable selection on high dimension heterogeneous data sets, which has been taking on renewed interest recently due to the growth of biological and medical data sets with complex, non-i.i.d. structures and huge quantities of response variables. The heterogeneity is likely to confound the association between explanatory variables and responses, resulting in enormous false discoveries when Lasso or its variants are na\"ively applied. Therefore, developing effective confounder correction methods is a growing heat point among researchers. However, ordinarily employing recent confounder correction methods will result in undesirable performance due to the ignorance of the convoluted interdependency among response variables. To fully improve current variable selection methods, we introduce a model, the tree-guided sparse linear mixed model, that can utilize the dependency information from multiple responses to explore how specifically clusters are and select the active variables from heterogeneous data. Through extensive experiments on synthetic and real data sets, we show that our proposed model outperforms the existing methods and achieves the highest ROC area.
翻译:我们考虑高维异质性数据上的稀疏变量选择问题。由于生物和医学数据集的增长,这些数据集具有复杂的非独立同分布结构以及大量的响应变量,该问题近年来重新引起了研究兴趣。异质性可能会混淆解释变量与响应之间的关联,导致在朴素应用Lasso或其变体时产生大量假阳性发现。因此,开发有效的混淆因素校正方法是研究者日益关注的热点。然而,由于忽略了响应变量之间复杂的相互依赖关系,简单地采用现有的混淆校正方法会导致不理想的性能。为了全面改进当前的变量选择方法,我们引入了一种模型——树引导的稀疏线性混合模型,该模型能够利用多个响应的依赖信息来探索聚类的具体模式,并从异质性数据中筛选出活跃变量。通过在合成和真实数据集上的大量实验,我们表明所提出的模型优于现有方法,并获得了最高的ROC面积。