With the surge in data generation, both vertically (i.e., volume of data) and horizontally (i.e., dimensionality), the burden of the curse of dimensionality has become increasingly palpable. Feature selection, a key facet of dimensionality reduction techniques, has advanced considerably to address this challenge. One such advancement is the Boruta feature selection algorithm, which successfully discerns meaningful features by contrasting them to their permutated counterparts known as shadow features. However, the significance of a feature is shaped more by the data's overall traits than by its intrinsic value, a sentiment echoed in the conventional Boruta algorithm where shadow features closely mimic the characteristics of the original ones. Building on this premise, this paper introduces an innovative approach to the Boruta feature selection algorithm by incorporating noise into the shadow variables. Drawing parallels from the perturbation analysis framework of artificial neural networks, this evolved version of the Boruta method is presented. Rigorous testing on four publicly available benchmark datasets revealed that this proposed technique outperforms the classic Boruta algorithm, underscoring its potential for enhanced, accurate feature selection.
翻译:随着数据在纵向(即数据量)和横向(即维度)上的激增,维度灾难的负担已愈发明显。特征选择作为降维技术的关键组成部分,在应对这一挑战方面取得了显著进展。其中一项进展是Boruta特征选择算法,该算法通过将特征与其置换后生成的影子特征进行对比,成功甄别出有意义的特征。然而,特征的重要性更多取决于数据的整体特性而非其内在价值——这一观点在传统Boruta算法中也得到体现,其中影子特征紧密模仿原始特征的特征。基于这一前提,本文通过向影子变量中引入噪声,提出了一种创新的Boruta特征选择算法改进方案。借鉴人工神经网络的扰动分析框架,本文给出了这种Boruta方法的演进版本。在四个公开基准数据集上的严格测试表明,所提出的技术优于经典Boruta算法,凸显了其在实现更精准特征选择方面的潜力。