We investigate feature selection problem for generic machine learning (ML) models. We introduce a novel framework that selects features considering the predictions of the model. Our framework innovates by using a novel feature masking approach to eliminate the features during the selection process, instead of completely removing them from the dataset. This allows us to use the same ML model during feature selection, unlike other feature selection methods where we need to train the ML model again as the dataset has different dimensions on each iteration. We obtain the mask operator using the predictions of the ML model, which offers a comprehensive view on the subsets of the features essential for the predictive performance of the model. A variety of approaches exist in the feature selection literature. However, no study has introduced a training-free framework for a generic ML model to select features while considering the importance of the feature subsets as a whole, instead of focusing on the individual features. We demonstrate significant performance improvements on the real-life datasets under different settings using LightGBM and Multi-Layer Perceptron as our ML models. Additionally, we openly share the implementation code for our methods to encourage the research and the contributions in this area.
翻译:我们研究了通用机器学习(ML)模型中的特征选择问题。我们提出了一种新颖的框架,该框架基于模型预测选择特征。该框架的创新之处在于采用了一种新颖的特征掩码方法,在特征选择过程中消除特征,而非将其从数据集中完全移除。这使得我们能够在特征选择过程中使用同一个ML模型,而不同于其他特征选择方法——后者因每次迭代数据集维度不同而需重新训练ML模型。我们利用ML模型的预测结果获取掩码算子,从而全面把握对模型预测性能至关重要的特征子集组合。尽管特征选择文献中存在多种方法,但尚无研究针对通用ML模型提出一种无需重训练即可选择特征的框架,且该框架需将特征子集作为一个整体(而非单个特征)考虑其重要性。我们以LightGBM和多层感知器作为ML模型,在真实数据集的不同设置下验证了该方法显著的性能提升。此外,我们公开分享了方法的实现代码,以促进该领域的研究与贡献。