Stability selection is a widely used method for improving the performance of feature selection algorithms. However, stability selection has been found to be highly conservative, resulting in low sensitivity. Further, the theoretical bound on the expected number of false positives, E(FP), is relatively loose, making it difficult to know how many false positives to expect in practice. In this paper, we introduce a novel method for stability selection based on integrating the stability paths rather than maximizing over them. This yields a tighter bound on E(FP), resulting in a feature selection criterion that has higher sensitivity in practice and is better calibrated in terms of matching the target E(FP). Our proposed method requires the same amount of computation as the original stability selection algorithm, and only requires the user to specify one input parameter, a target value for E(FP). We provide theoretical bounds on performance, and demonstrate the method on simulations and real data from cancer gene expression studies.
翻译:稳定性选择是一种广泛用于提升特征选择算法性能的方法。然而,现有研究表明稳定性选择过于保守,导致灵敏度较低。此外,期望假阳性数E(FP)的理论界相对宽松,使得实践中难以准确预知假阳性数量。本文提出一种基于稳定性路径积分而非取最大值的新型稳定性选择方法。该方法推导出更紧的E(FP)界,从而得到在实践中灵敏度更高且与目标E(FP)匹配更精准的特征选择准则。所提方法与原始稳定性选择算法具有相同的计算量,仅需用户指定一个输入参数(目标E(FP)值)。我们给出了性能的理论界限,并通过模拟实验和癌症基因表达真实数据验证了该方法。