Gene expression classification is a pivotal yet challenging task in bioinformatics, primarily due to the high dimensionality of genomic data and the risk of overfitting. To bridge this gap, we propose BOLIMES, a novel feature selection algorithm designed to enhance gene expression classification by systematically refining the feature subset. Unlike conventional methods that rely solely on statistical ranking or classifier-specific selection, we integrate the robustness of Boruta with the interpretability of LIME, ensuring that only the most relevant and influential genes are retained. BOLIMES first employs Boruta to filter out non-informative genes by comparing each feature against its randomized counterpart, thus preserving valuable information. It then uses LIME to rank the remaining genes based on their local importance to the classifier. Finally, an iterative classification evaluation determines the optimal feature subset by selecting the number of genes that maximizes predictive accuracy. By combining exhaustive feature selection with interpretability-driven refinement, our solution effectively balances dimensionality reduction with high classification performance, offering a powerful solution for high-dimensional gene expression analysis.
翻译:基因表达分类是生物信息学中一项关键且具有挑战性的任务,这主要源于基因组数据的高维特性以及过拟合的风险。为弥补这一不足,我们提出了BOLIMES——一种新颖的特征选择算法,旨在通过系统优化特征子集来提升基因表达分类性能。与仅依赖统计排序或分类器特定选择的传统方法不同,我们结合了Boruta的鲁棒性与LIME的可解释性,确保仅保留最相关且最具影响力的基因。BOLIMES首先利用Boruta,通过将每个特征与其随机化副本进行比较来过滤非信息基因,从而保留有价值的信息。随后使用LIME根据基因对分类器的局部重要性对剩余基因进行排序。最后,通过迭代分类评估,选择能使预测准确率最大化的基因数量,从而确定最优特征子集。通过将穷举式特征选择与可解释性驱动的优化相结合,我们的方案在降维与高分类性能之间实现了有效平衡,为高维基因表达分析提供了强有力的解决方案。