ROOFS: RObust biOmarker Feature Selection

Feature selection (FS) is essential for biomarker discovery and clinical predictive modeling. Over the past decades, methodological literature on FS has become rich and mature, offering a wide spectrum of algorithmic approaches. However, much of this methodological progress has not fully translated into applied biomedical research. Moreover, challenges inherent in biomedical data, such as high-dimensional feature space, low sample size, multicollinearity, and missing values, make FS non-trivial. To help bridge this gap between methodological development and practical application, we propose ROOFS (RObust biOmarker Feature Selection), a Python package available at https://gitlab.inria.fr/compo/roofs, designed to help researchers in the choice of FS method adapted to their problem. ROOFS benchmarks multiple FS methods on the user's data and generates reports summarizing a comprehensive set of evaluation metrics, including downstream predictive performance estimated using optimism correction, stability, robustness of individual features, and true positive and false positive rates assessed on semi-synthetic data with a simulated outcome. We demonstrate the utility of ROOFS on data from the PIONeeR clinical trial, aimed at identifying predictors of resistance to anti-PD-(L)1 immunotherapy in lung cancer. Of the 34 FS methods gathered in ROOFS, we evaluated 23 in combination with 11 classifiers (253 models) and identified a filter based on the union of Benjamini-Hochberg false discovery rate-adjusted p-values from t-test and logistic regression as the optimal approach, outperforming other methods including widely used LASSO. We conclude that comprehensive benchmarking with ROOFS has the potential to improve the reproducibility of FS discoveries and increase the translational value of clinical models.

翻译：特征选择（FS）是生物标志物发现和临床预测建模的关键环节。过去数十年来，关于特征选择的方法学文献已变得丰富而成熟，提供了广泛的算法途径。然而，这些方法学进展大多未能充分转化为生物医学研究的实际应用。此外，生物医学数据固有的挑战——如高维特征空间、小样本量、多重共线性和缺失值——使得特征选择并非易事。为弥合方法学发展与实际应用之间的鸿沟，我们提出了ROOFS（鲁棒的生物标志物特征选择），这是一个Python软件包（可通过https://gitlab.inria.fr/compo/roofs获取），旨在帮助研究人员选择适合其问题的特征选择方法。ROOFS在用户数据上对多种特征选择方法进行基准测试，并生成报告，汇总一系列全面的评估指标，包括使用乐观校正估计的下游预测性能、稳定性、个体特征的鲁棒性，以及通过模拟结果的半合成数据评估的真阳性率和假阳性率。我们在PIONeeR临床试验数据上展示了ROOFS的实用性，该试验旨在识别肺癌抗PD-(L)1免疫疗法耐药性的预测因子。在ROOFS收集的34种特征选择方法中，我们评估了23种方法与11种分类器的组合（共253个模型），发现基于t检验和逻辑回归的Benjamini-Hochberg错误发现率校正p值并集的过滤方法为最优策略，其表现优于包括广泛使用的LASSO在内的其他方法。我们得出结论：利用ROOFS进行综合基准测试，有望提升特征选择发现的可重复性，并增强临床模型的转化价值。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

《鲁棒优化中保形预测生成不确定性集的性能评价》最新95页

专知会员服务

10+阅读 · 3月20日

【牛津大学博士论文】图机器学习的鲁棒性分析

专知会员服务

31+阅读 · 2024年4月30日

不同表征如何对齐？普林斯顿MIT谷歌等30位作者《表征对齐》综述，详述其框架

专知会员服务

48+阅读 · 2023年12月28日

《随机森林排列特征在离子迁移光谱特征选择中的重要性》2022最新美国陆军研究实验室24页论文

专知会员服务

20+阅读 · 2022年10月28日