A common problem in machine learning is determining if a variable significantly contributes to a model's prediction performance. This problem is aggravated for datasets, such as gene expression datasets, that suffer the worst case of dimensionality: a low number of observations along with a high number of possible explanatory variables. In such scenarios, traditional methods for testing variable statistical significance or constructing variable confidence intervals do not apply. To address these problems, we developed a novel permutation framework for testing the significance of variables in supervised models. Our permutation framework has three main advantages. First, it is non-parametric and does not rely on distributional assumptions or asymptotic results. Second, it not only ranks model variables in terms of relative importance, but also tests for statistical significance of each variable. Third, it can test for the significance of the interaction between model variables. We applied this permutation framework to multi-class classification of the Iris flower dataset and of brain regions in RNA expression data, and using this framework showed variable-level statistical significance and interactions.
翻译:机器学习中的一个常见问题是判断某个变量是否对模型的预测性能有显著贡献。对于基因表达数据等面临最严重维度灾难(即观测样本数量少而潜在解释变量数量多)的数据集,该问题尤为突出。在此类场景中,传统的变量统计显著性检验方法及变量置信区间构建方法均不适用。为解决这些问题,我们开发了一种新的置换框架,用于检验监督模型中变量的显著性。该框架具有三大优势:第一,它是一种非参数方法,不依赖分布假设或渐近理论;第二,它不仅可按相对重要性对模型变量进行排序,还能检验每个变量的统计显著性;第三,它能够检验变量间交互作用的显著性。我们将此置换框架应用于鸢尾花数据集的多类分类任务以及RNA表达数据中脑区的分类任务,通过该框架揭示了变量层面的统计显著性及其交互作用。