In modern scientific research, the objective is often to identify which variables are associated with an outcome among a large class of potential predictors. This goal can be achieved by selecting variables in a manner that controls the the false discovery rate (FDR), the proportion of irrelevant predictors among the selections. Knockoff filtering is a cutting-edge approach to variable selection that provides FDR control. Existing knockoff statistics frequently employ linear models to assess relationships between features and the response, but the linearity assumption is often violated in real world applications. This may result in poor power to detect truly prognostic variables. We introduce a knockoff statistic based on the conditional prediction function (CPF), which can pair with state-of-art machine learning predictive models, such as deep neural networks. The CPF statistics can capture the nonlinear relationships between predictors and outcomes while also accounting for correlation between features. We illustrate the capability of the CPF statistics to provide superior power over common knockoff statistics with continuous, categorical, and survival outcomes using repeated simulations. Knockoff filtering with the CPF statistics is demonstrated using (1) a residential building dataset to select predictors for the actual sales prices and (2) the TCGA dataset to select genes that are correlated with disease staging in lung cancer patients.
翻译:在现代科学研究中,目标通常是识别大量潜在预测因子中哪些变量与结果相关。这一目标可通过在控制错误发现率(即选择中不相关预测因子的比例)的前提下进行变量选择来实现。Knockoff滤波是一种前沿的变量选择方法,能提供错误发现率控制。现有knockoff统计量常采用线性模型评估特征与响应变量之间的关系,但线性假设在实际应用中往往不成立,这可能导致检测真正预后变量的统计效能低下。我们提出了一种基于条件预测函数(CPF)的knockoff统计量,该统计量可与深度学习神经网络等最先进的机器学习预测模型配合使用。CPF统计量既能捕捉预测因子与结果之间的非线性关系,同时也能考虑特征间的相关性。通过重复模拟实验,我们展示了CPF统计量在连续型、分类型和生存结局中相较于常见knockoff统计量具有更优的统计效能。采用CPF统计量的Knockoff滤波方法在以下数据集中得到验证:(1)住宅建筑数据集中预测实际售价的因子选择;(2)TCGA数据集中识别与肺癌患者分期相关的基因选择。