Noise-Augmented Boruta: The Neural Network Perturbation Infusion with Boruta Feature Selection

With the surge in data generation, both vertically (i.e., volume of data) and horizontally (i.e., dimensionality), the burden of the curse of dimensionality has become increasingly palpable. Feature selection, a key facet of dimensionality reduction techniques, has advanced considerably to address this challenge. One such advancement is the Boruta feature selection algorithm, which successfully discerns meaningful features by contrasting them to their permutated counterparts known as shadow features. However, the significance of a feature is shaped more by the data's overall traits than by its intrinsic value, a sentiment echoed in the conventional Boruta algorithm where shadow features closely mimic the characteristics of the original ones. Building on this premise, this paper introduces an innovative approach to the Boruta feature selection algorithm by incorporating noise into the shadow variables. Drawing parallels from the perturbation analysis framework of artificial neural networks, this evolved version of the Boruta method is presented. Rigorous testing on four publicly available benchmark datasets revealed that this proposed technique outperforms the classic Boruta algorithm, underscoring its potential for enhanced, accurate feature selection.

翻译：随着数据在纵向（即数据量）和横向（即维度）上的激增，维度灾难的负担已愈发明显。特征选择作为降维技术的关键组成部分，在应对这一挑战方面取得了显著进展。其中一项进展是Boruta特征选择算法，该算法通过将特征与其置换后生成的影子特征进行对比，成功甄别出有意义的特征。然而，特征的重要性更多取决于数据的整体特性而非其内在价值——这一观点在传统Boruta算法中也得到体现，其中影子特征紧密模仿原始特征的特征。基于这一前提，本文通过向影子变量中引入噪声，提出了一种创新的Boruta特征选择算法改进方案。借鉴人工神经网络的扰动分析框架，本文给出了这种Boruta方法的演进版本。在四个公开基准数据集上的严格测试表明，所提出的技术优于经典Boruta算法，凸显了其在实现更精准特征选择方面的潜力。

相关内容

特征选择

关注 5940

特征选择( Feature Selection )也称特征子集选择( Feature Subset Selection , FSS )，或属性选择( Attribute Selection )。是指从已有的M个特征(Feature)中选择N个特征使得系统的特定指标最优化，是从原始特征中选择出一些最有效特征以降低数据集维度的过程,是提高学习算法性能的一个重要手段,也是模式识别中关键的数据预处理步骤。对于一个学习算法来说,好的学习样本是训练模型的关键。

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【CHI2020-微软】解释可解释性:理解数据科学家使用机器学习的可解释性工具，Interpreting Interpretability: Understanding Data Scientists’Use of Interpretability Tools for Machine Learning

专知会员服务

55+阅读 · 2020年3月8日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日