Controlled variable selection is an important analytical step in various scientific fields, such as brain imaging or genomics. In these high-dimensional data settings, considering too many variables leads to poor models and high costs, hence the need for statistical guarantees on false positives. Knockoffs are a popular statistical tool for conditional variable selection in high dimension. However, they control for the expected proportion of false discoveries (FDR) and not their actual proportion (FDP). We present a new method, KOPI, that controls the proportion of false discoveries for Knockoff-based inference. The proposed method also relies on a new type of aggregation to address the undesirable randomness associated with classical Knockoff inference. We demonstrate FDP control and substantial power gains over existing Knockoff-based methods in various simulation settings and achieve good sensitivity/specificity tradeoffs on brain imaging and genomic data.
翻译:受控变量选择是脑成像或基因组学等科学领域中的重要分析步骤。在高维数据场景下,纳入过多变量会导致模型性能下降和成本增加,因此需要对假阳性进行统计保障。奈克欧福(Knockoffs)是一种在高维条件下进行条件变量选择的流行统计工具。然而,该方法控制的是虚假发现的期望比例(FDR),而非实际比例(FDP)。我们提出一种名为KOPI的新方法,可针对基于奈克欧福的推断控制虚假发现的比例。该方法还采用新型聚合技术,以解决经典奈克欧福推断中不可取的随机性问题。在多种模拟场景中,我们验证了KOPI对FDP的控制能力,并证明其相较于现有奈克欧福方法具有显著功效提升;在脑成像和基因组数据上,该方法实现了良好的灵敏度/特异度平衡。