BOLIMES: Boruta and LIME optiMized fEature Selection for Gene Expression Classification

Gene expression classification is a pivotal yet challenging task in bioinformatics, primarily due to the high dimensionality of genomic data and the risk of overfitting. To bridge this gap, we propose BOLIMES, a novel feature selection algorithm designed to enhance gene expression classification by systematically refining the feature subset. Unlike conventional methods that rely solely on statistical ranking or classifier-specific selection, we integrate the robustness of Boruta with the interpretability of LIME, ensuring that only the most relevant and influential genes are retained. BOLIMES first employs Boruta to filter out non-informative genes by comparing each feature against its randomized counterpart, thus preserving valuable information. It then uses LIME to rank the remaining genes based on their local importance to the classifier. Finally, an iterative classification evaluation determines the optimal feature subset by selecting the number of genes that maximizes predictive accuracy. By combining exhaustive feature selection with interpretability-driven refinement, our solution effectively balances dimensionality reduction with high classification performance, offering a powerful solution for high-dimensional gene expression analysis.

翻译：基因表达分类是生物信息学中一项关键且具有挑战性的任务，这主要源于基因组数据的高维特性以及过拟合的风险。为弥补这一不足，我们提出了BOLIMES——一种新颖的特征选择算法，旨在通过系统优化特征子集来提升基因表达分类性能。与仅依赖统计排序或分类器特定选择的传统方法不同，我们结合了Boruta的鲁棒性与LIME的可解释性，确保仅保留最相关且最具影响力的基因。BOLIMES首先利用Boruta，通过将每个特征与其随机化副本进行比较来过滤非信息基因，从而保留有价值的信息。随后使用LIME根据基因对分类器的局部重要性对剩余基因进行排序。最后，通过迭代分类评估，选择能使预测准确率最大化的基因数量，从而确定最优特征子集。通过将穷举式特征选择与可解释性驱动的优化相结合，我们的方案在降维与高分类性能之间实现了有效平衡，为高维基因表达分析提供了强有力的解决方案。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日